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Preface 


In Europe it has long been the practise to publish small books 
known as manuals, giving a condensed and succinct treatment of the 
subject as an aid to the student in summarizing the more elaborate 
and detailed contents of a large textbook. Halfway between an article 
in an encyclopedia and the often discursive texts studied in the class* 
room, these manuals fulfil a real need; they winnow fundamental prin- 
ples from a mass of material, give with incisive clarity the contours 
the ground already covered and leave with the reader a definite 
iramework which later he may fill in by further practical contact with 
'he subject in real life. In America there has been a noticeable gap m 
educational literature which now happily will be filled in part by the 
excellent series of small books published by Barnes & Noble, Inc. 

The present volume on statistics does not aim to be a compre- 
hensive treatise on the subject. On the contrary it gives the distilled 
essence of material which might well require one or more large volumes 
for a full discussion. For that very reason it ought to be a most iv^ftil 
tool in the hands alike of students and people actually engaj^d in 
statistical work, wherever the particular field of activity may happen 
to lie. The formulas and examples given in it will be ampk for the 
uccds of most workers, whether they be concerned with fiiancial, in- 
dustrial, commercial, social, or educational statistics. Of rccessity there 
is a minimum of verbiage; no formula is included that fas not praaical 
applications; the mathematical aspects are not stresses; the philosophy 
of the subject and many recondite byways are not explored. TTius the 
reader is spared the need of hunting back and forth to find the special 
help which he needs on his concrete problems in daily life. To all ^ 
statistical workers this little volume will be as indispensable a$ an 
adding machine. 


JUSTIN H. MOORE, Professor of Law 
School of Business and Civic Administration 
The College of the City of New Yorl^ 
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CHAPTER I 
STATISTICAL SERIES 


Definition of Statlitical Mothed 

Statistical Method is a technique used to obtain, analyze and 
present numerical data. 

EUmontf of Stotiiticol Technique 

The elements of statistical technique include the: 

1. Collection and assemblini; of data. 

2. Classification and condcnbation of data. 

3. Presentation of data in: 

a. textular form. 

b. tabular form. 

c. graphic form. 

4. Analysis of data. 


Characteriilia and Limitationi of Statistical Methods 

1. Statistical method is the only means for handling large 
masses of numerical data. 

2. Statistical technique applies only to data which are reducible 
to quantitative form. 

3. Statistical technique is objective. The results, however, 
cannot but be affected by the necessarily subjective interpretation 

4. Statistical technique is the same for the social as for the 
physical sciences; i.e., economics, education, sociology and psy> 
chology as contrasted with biology, chemistry and astronomy. 
Method and technique apply alike to these divergent fields. 

Statlitical Seriti 

In order to analyze numerical data, it is first necessary to 
arrange them systematically. The data may be arranged in a 
number of different ways. Technically an arrangement is called 
a distribution or series. An example of each tjrpe of distribu- 
tion is shown below: 


Whu data an graupad Sicording to: 

1. Magnitude 

2. Time of occurrence 

3. Geographic location 


Tho malting nrlot It callod <; 

Frequency distribution 
Time series 
Spatial distribution 


1 
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STATISTICAL METHODS 


In addition there is a number of special types of distributions 
in which the data may be arranged by kind or by degree. 

The Frequency Distribution 
DmfinlNon 

The frequency distribution is an arrangement of numerical 
data according to size or magnitude. 

ConstrveNon 

A frequency distribution is constructed in the following 
manner: 

1 . Using the range of the data (the interval between the high- 
est and the lowest figure) as a guide, the data are divided into 
a number of convenient sized groups. The groups are called 
class intervals (compare table 1, column 1).^ 

The size of the class interval is dependent upon the number of 
values to be included in the distribution. The range of the values 
(difference between the highest and lowest values) is determined 
and is divided by the number of class intervals desired. The 
resulting size is rounded off. Few class intervals are used when a 
limited number of values are included and a large number when 
the distribution is to be compiled from many values. The most 
efficient number of class intervals usually lies between ten and 
twenty groups. 

Other requirements for the determination of the class inten^al are 

a. The class intervals should not overlap; 

0-4.9, 5-9.9 etc., should be used in preference to 
0-5, 5 — 10, etc. 

b. When the values tabulated coincide wdth the integers or 
with selected values, these values should generally constitute 
the midpoints of the groups. 

c. When possible the class intervalsshould be of a uniform size. 

2. The groups are then placed in a column with the lowest 
class interval at the top and the rest of the class intervals 
following according to size. 

3. The data are then scored. Each figure is checked once next 
to the class interval into which it falls (see tally, table 1).* 

Graphic Presentation of Frequency Distribution 

If two lines are drawn perpendicular to one another and are 
divided according to a scale of values, given data may be repre- 
sented by reference to the scale The horizontal line is known as 
the X axis and the vertical line as the V axis. If the values for 

^ Am m preliminary etep tha rew data may be arranged aoeordinc to idBe. The eeriee ii then 
ealled an array. 

2 In •conns data an efficient prooedurp ii to connect cbp 6nt and fourth loorp oy the fifth. 
Cains this procedure totab are nbtamed merely by adding the resulting units muitiplvms by 
five and adding the odd •cores (see Table 1). A tally of this typp saves tune in eonnUnt 
fraquenelaa and ■bo rilminates poembla inacouracias 
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Table 1 — Score Sheet 

City Tax Rare of "True" Valuation in 261 Citiei in the United States, 1927 

Rate Per Total Number 

Thousand Dollars of Cities 

(Class Intervals) Falh (frequency) 

4 - 7 99 ni/ f) 

8-11.99 fHi mj rrn 15 

12 - 15.99 rnjrHJitijni//HjrHjfHjrwrHii 4h 

16 - 19 99 rNjnvrwn<jfwfHjrHjfHjniJtwrrijrmrHJin or 

20-23 99 rwrwrnjtwrwnijrwrwrturHjrH/in r >8 

24 - 27.99 ruj mj nu mimj fw n 32 

28 -31-99 rw mj rw mj u 22 

32- 35 99 IHI rw 10 

36 - 39.99 n 2 

40 - 43.99 II 2 

44-47.99 0 

48- 51.99 / I 


261 

Source United States Department of Commerce, FinajtrCial StiUislics oj ChtiPS^ 
.^27, Table 23. 


a point are given the point may then be located on the graph. 
For example, in figure 1 the point X — 2, F — 3 may now be locat- 
ed at the point marked a. 



t'lg, 1 — Location of Plotted point X = 2, 7 = 8. 
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If the two axes are now marked in the units of the given data 
the frequency distribution may be presented graphically (pic- 
torially). 

1. The class interval grouping which will be termed the Inde- 
pendent variable is placed on the X (horizontal) axis and 
the frequency or dependent variable is placed on the Y 
(vertical) axis.' 

2. The number of cases (frequency) is plotted at the midpoint 
of each respective class interval at the appropriate horizon- 
tal level as indicated by the scale on the Y axis.* 

3. When connected the plotted points form a frequency poly- 
gon. 

4. Rectangles may be constructed by using as the width the 
size of the class interval and as the height the frequency in 
each class interval. The rectangles form an histogram (also 
known as a rectangular frequency polygon). 



Fig. 2 — Distribution of Tax Rates on ‘‘True” Valuation 
for 261 Cities in the United States, 1927. 


Cumulativf Frequency Dislribution 

A^idistiilbfution in which the frequencies are cumulated is known 
as an €>gii%« Examples of the ogive are shown in tables 2a and 2b. 

1 Tht dktonM along tht I udo li oalled the sbulMS whilo thftt slong fcho 7 szli ia 0sU«d 
ihs onUnstv. 

I A ( •lloiisl bMkground or ruling dibj b« uiod to sld in fchii work. 
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TabU 2a — "L»ii Thon" Ogive 

DiflilbuYlori of Wholeiale Salei by Size of Firm for ihe Unfled Sfalei, 1930 


01m of Firm Number of 

(TbouMDdB of DoUan of Balm) Firmi 

than 25 14,235 

Leas than 50 23,568 

Leas than 100 36 154 

Less than 200 49,667 

Less than 300 57,081 

Less than 400 61,701 

Leas than 500 64,716 

! than 1,000 71,453 

Leas than 25,000 76,600 


Source: United States Department of Commerce 1030 Census. 

In this table a “less than" ogive is used. This distribution may 
easily be converted into an “and over" ogive by cumulating the 
items on an “and over" basis as in table 2b. 

Table 2b — Dlitrlbutlon of Formi In New England by Sfie^ 1930 


BiB« la Aerm 

Number of Panne 

0 and over 

124,925 

20 and over 

104,948 

50 and over 

84,664 

100 and over 

54,924 

175 and over 

24,628 

260 and over 

10,494 

500 and over 

2,165 

1000 and over 

392 

5000 and over 

0 


Source. United States Department of Commerce, 1930 Census. 

Analyiii 

Unless a mass of data is grouped it is un wieldly and in many 
cases impossible to analyze. In a frequency distribution the data 
are arranged so that with the application of further techniques 
analysis of the data is made possible. In itself the mere grouping 
of data does not present an analysis. 

Types of Frequency Distribufioni 

The more usual types of distributions are shown below. In addi- 
tion to these there are a number of unusual types such as the 
bimodal curve (two peaks), the “j" and inverted “j" curves 
(curves shaped to resemble a “j"), etc. 

1. S3rnimetrical distribution: The “normal" curve is the best 
known example of a symmetrical distribution. 

2. Skewed distribution: Most frequency distributions extend 
further in one direction than in the other. This type is 
known as a skewed distribution. It is, of course, identified 
by a lack of symmetry. 

a. The right (positively) skewed distribution is caused by 
the extremes in the higher values distorting the curve 
towards the right. 
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b. The left (negatively) skewed distribution, a less common 
type, is caused by extremes in the lower values which 
distort the curve towards the left. 



Fig. 6— Hypothetical Left Skewed Distribution of Wag-es in a Factory. 


Characteristics of the Frequency Dislribution:^ 

Natural, economic and sociological data show a distinct ten- 
dency to group about a given point. 

This grouping tendency gives rise to a peak which always occurs 
in frequency distributions. The location of the peak or j>oint of 
central or common tendency is one of the characteristics which may 
be measured. In the graph below the two distributions are iden- 
tical in nature except that the points of central tendency are locat- 
ed at different positions on the scale. 


The tendency of a group of values to cluster about a central 
point makes possible the use of a typical value to describe the mass 
of data. The location of this point of central tendency may be 
measured by the average. 


• Frequenoy dutnbutioni are commonly divided into two type*, contlnooui and dlietate 
(non oontinuoua). In the con/inmme senes it is poMible bo have every size of the variables 
between given limit*, i.e., in the diitribution of weight*, or of agee of individual* where any size 
li po^ble. In the dtscrde eprisa only limited gradation* are poaeible, ai in. for instanoe. the 
distribution of number* of pupil* in public sohooli in the United State*. Fraction* of a unit 
cannot appear in this eerie*. 
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W&ffaa In l>ollnn 


Fig. 6 — Hypothetical Distribution of Wages in Two Factories. 


Diipenion 

In figure 7 the distributions are identical in character. The 
values of the items included in curve a, however, vary to a greater 
degree than curve b. The degree of variation differs from curve 
to curve and is known as dispersion. Dispersion, then, may be 
defined as the variation in size occurring among the various items 
constituting the series. 



Fig. 7— Hypothetical Dietribution of Wages in Two Factoriea. 
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Sk«wntu 

The distributions in figure 8 differ in that curve a is symmetri- 
cal while curve 6 is not. The lack of symmetry is known as 
skewness. 



Fig. 8 — H3rpothetical Distribution of Wages in Two Factories. 


Kuftotit 

The two curves in figure 9 differ in their degree of “peakedness.'' 
This characteristic is known as kurtosis. 
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CHAPTER n 


ANALYSIS OF THE FREQUENCY DISTRIBUTION 
CENTRAL TENDENCY-ARITHMETIC MEAN 

Measures of Common Tendency-Averages 

An average is a typical value which tends to sum up or describe 
the mass of data. It also serves as a basis for measuring or 
evaluating extreme or unusual values. The average is a measure 
of the location of central tendency. 

Kinds of Averages 

The most important kinds of averages arc : 

1. The arithmetic mean (Discussed in Chap. II) 

2. The median (Discusbcd in Chap. Ill) 

3. The mode 

4. The geometric mean 

5. The quadratic mean “ 

The Arithmetic Mean 

Due to ease of computation and long usage the arithmetic 
mean is the best known and most commonly used of all the 
averages. 

Methods of Calculating the Arithmetic Meon 

Ungroup^d Data 

The arithmetic mean of a small group of individual items may 
be obtained by adding all items together and dividing the total 
by the number of items used. 

The computation of the arithmetic mean is expressed in for- 
mula form as:^ 

, ^ ~ N 

where 

X * arithmetic mean 
S « symbol meaning “sum of 
X * data expressed as individual items 
N » number of items 

* Thli tormulA li alio clven in varioui textboukB, uung different lymboU, ai' 

(a) M --^2 (X) 

(b) Af 
' The lymbol X ie a Qreak capital letter Sigme 

11 


^ (rn) 
N 



12 


STATISTICAL METHODS 


Groupmd Dofo 

Where the arithmetic mean of a considerable number of items 
is to be computed the method outlined above is generally too 
laborious and too subject to error. With increasing numbers of 
items the simple problem of addition can become difficult to the 
point of physical impossibility. For instance, if the arithmetic 
mean is to be applied to data containing forty or fifty thousand 
items, correct addition of the huge mass of numbers is next to 
impossible even with the aid of an adding machine. 

A more convenient and efficient method is to group the data 
into the fprm of a frequency distribution, and then compute the 
arithmetic mean for the distribution. 

The arithmetic mean may be computed from the distribution 
in table 3 by assuming that all the values included within the limits 
of a class interval are distributed evenly in it^ and therefore the 
average value of all the cases in each class interval coincides with 
the midpoint. 

Long Method 

Since no knowledge is available of the actual distribution of the 
cases within each group, it may be assumed that the cases are dis- 
tributed evenly between the limits of the group. This would result 
in an average value for all values in the group equal to the mid- 
point of the group. Thus, the total value for each group may be 
obtained by multiplying the mid-point of the group by the number 
of cases in the group. (See table 3). 

Thus for the frequency distribution in table 3 the midpoint 

Table 3 — ComputaHon of Arithmetic Mean by Long Method — 
Grouped Data 

City Tax Rate on "True" Valuation In 261 Citiei In the United States, 1927 


(I) 

(2) 

(3) 

(4) 

Retee Per Thoueend 


Number of 


DoUeri (In DoUmre) 


Cities 


CUee 

Midpoint 

rrequenoy 

frequeuoy M Idpoin 1 

Interval 

(M. P.) 

U) 

(/) X(M. P.) 

14- 7.99 

$6 

5 

30 

8-11.99 

10 

15 

150 

12-15.99 

14 

46 

644 

16-19.99 

18 

68 

1224 

20-23.99 

22 

58 

1276 

24-27.99 

26 

32 

832 

28-31.99 

30 

22 

660 

32-35.99 

34 

10 

340 

36-39.99 

38 

2 

76 

40-43.99 

42 

2 

84 

44-47.99 

46 

0 

0 

48-51.90 

50 

1 

50 



261 

5366 


Source: United States Department of Commerce, Financial SiatisiicB of 
CUifB, 1927, Table 23. 

1 If the oUm Intervd ii not loo Urge titid e niffidont number of oMOe are nvaiUble, irwie* 
ttoa from the emumptlon will he em^ end ne nooumte reeult will be obteined. 



ANALYSIS OF THE FREQUENCY DISTRIBUTION 18 


of the first class interval ($6) is multiplied by the frequency indi- 
cated for that group (5) in order to obtain the total value for all 
cases in the class interval. The totals (tablg 3, column 4) are then 
added to obtain the total value of all cases in the frequency dis- 
tribution. The sum is divided by the number of cases (N) to obtain 
the arithmetic mean. 

This method may be generalized into formula form as : 

The above is known as the long method because of the complex 
calculations which may result when the frequencies and the mid- 
point values are large. 

Short Method 

o. Unlf DeWorffon Mmfhod 

A simpler method may be devised by an examination of the 
characteristics of the arithmetic mean. If the mean is computed 
for a number of individual items (as shown below) and the devia- 
tion (distance) of each of the items from the mean is obtained, 
these deviations will total up to zero.* 


Grades of Ten Studenrs on an Examlnafion in ArithmeMc 
Studeat Qrado Deviation from Mean 


Number 

(Per Cent) 

(*) 

1 

95% 

15% 

2 

92 

12% 

3 

90 

10% 

4 

86 

6% 

5 

86 

6% 

6 

80 

0% 

7 

75 

- 5% 

8 

72 

- 8% 

9 

64 

- 16% 

10 

60 

-20% 


Total 800 % 

0 


Mean {X) - - 80% 



If, however, some point other than the true arithmetic mean is 
selected the sum of the deviations from this point will not be 
zero. In the same series, for instance, 90% may be selected as 
an arbitrary starting point known Jbechnically as the Guessed 
Mean, and identified by the symbol Z. 

* Tht lotttr X It Mtigned u the eymbol for dovintion from the tnthmefcii: mean 
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BtudenI 

Number 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Grade 

Deviation from 
Queened Mmn 

(Per Cent) 


95% 

+ 5 % 

92 

+ 2% 

90 

0 % 

86 

- 4 % 

86 

- 4 % 

80 

- 10 % 

75 

- 15% 

72 

- 18% 

64 

- 26 % 

60 

- 30 % 

800% 

- 100 % 


If the average deviation of each, item from the guessed mean 
is obtained : 


S id) 
N 


- 100 % 

10 


— 10% (average deviation) 


and if this value is added to the arbitrary starting point (Z) the 
result will be the arithmetic mean.^ 


Where 

Z 

d 

N 


X - Z + 


2 (d) 
N 


90% + 


(- 100 %) 

10 


80%, 


/c 


guessed mean 

deviation of each value from guessed mean 
number of cases 


Table 4 Compulalion of Arithmelic Mean — Short-Unir DeviaTion Method 


Ratio of Current Anefi to Current Liabilities for 221 fndustria Corporations In 
the United States, 1930 


(1) 

(3) 

(3) 

(4) 

( 5 ) 

Ratios 

(Class 

(Midpoint) 

Niunber of 
Companies 
(froQuenoy) 

(doviBtiun)Uraqueaio' XOeviation) 

Interval) 

(Af. P.) 


(d) 

fd 

0- 1.99 

1 

Jll 

- 4 

- 44 

2- 3 99 

3 

63 

-2 

- 106 

4- 5.99 

5 

47 

0 

0 

6- 7.99 

7 

37 

2 

74 

8- 9.99 

9 

21 

4 

84 

10-11.99 

11 

16 

6 

90 

12-13.99 

13 

13 

8 

104 

14-15.99 

15 

8 

10 

80 

16-17.99 

17 

10 

12 

120 

18-19 99 

19 

1 

14 

14 

20-21.99 

21 

2 

16 

32 

22-23.99 

23 

1 

18 

18 

24^-25.99 

25 

0 

20 

0 

26-27.99 

27 

1 

22 

22 

221 

Source: Moody's Investore Service, Moadt/’a 

Industnaia, 1931. 

494 


* Bm teohnioal appendii 1 ter nMlbemetioal proof. 
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This technique may readily be applied to grouped data. For 
the distribution above (table 4) an arbitrary starting point (guessed 
mean) may be selected. Though any value may be taken for con- 
^^uience the midpoint of one of the class intervals is generally 
Id. Here 5.00 (midpoint of third class interval, table 4) may be 
ilsed as the guessed mean. 

The deviation from the arbitrary mean of the items in each 
,.oup may then be obtained by getting the difference between 
he midpoint of each class interval and the guessed mean. Since 
Ue midpoint of the group is the average value of all items in the 
group this value (d) will represent the average deviation of the items 
in the group from the assumed mean. To obtain the total devia- 
tion for all items in the class interval it is necessary to multiply 
this deviation (d) by the frequency of the group (/). This is 
totaled ioT all class intervals to obtain the total deviation from the 
guessed mean of all values, and is then di\dded by N to obtain the 
average deviation about the guessed mean resulting in; 


S ifd) 

N 


494 

221 


= 224 


The above value is now added to the arbitrary starting point 
(guessed mean) to obtain the true arithmetic mean:* 




^ Z -\- 


^d) 

N~ 


X 


494 

5.00 + ~ = 7.24 


Where 

Z = guessed mean 
/ = frequency of each class interval 

d = deviation of midpoint of each group from guessed mean 
N = total number of cases 

6. Group Deviation Method 

The computation of the arithmetic mean from a frequency 
distribution may be further simplified after consideration of the 
characteristics of such a distribution. 

If the distriburion has class intervals of a uniform size^ it will 
be noted that the deviation of the midpoint of one group from the 
next will always be constant and equal to the size of the class 
'terval. In the distribution shown in table 4 there is a constant 
Terence of 2 between the midpoints, and this is equal to the size 
the class intervals — for example 0 to 1.99. 

The deviations of any one group from an^ other group may then 
measured in terms of class intervals. In the distribution below 

* B99 teohnlottl Appendix I for mBthematioaJ doarivafeion of thix formulx. 

' WherxYOT pouible fehii should b« Um rule in eompiling eu^ h dietiibution 
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(table 5) the midpoint of the 3rd class interval_ (5.00) has again 
been selected as the arbitrary or guessed mean (z). The midpoint 
of the first class interval deviates from the guessed mean by the 
amount of -4, or -2 class intervals. The deviation column 
(table 5, column 4) is expressed in terms of class intervals rather 
than in the original units of the data. The resulting values in the 
deviation column are numerically smaller and thus the computa- 
tion is simpler. 


Table 5 — Computalion of Arithmetic Mean — Short-Group Deviation Method 
Ratio of Current AneN to Current Liabilitlei for S21 Induitrlal Corporatloni In 
the United Stafei, 1 930 


(1) 

(2) 

(3) 

(4) 

(5) 



Number of 

(Deviation 


Ratio 


(CompaniM 

In Clau 


(Claaa 

(Midpoint) 

Fraquenoy) 

Intervale) 


IntervaD 

(3f. P.) 

(/) 


(/«n 

0- 1.99 

1 

11 

- 2 

-22 

2- 3.99 

3 

53 

-1 

-63 

4- 6.99 

5 ^ 

47 

0 

0 

5- 7.99 

7 

37 

1 

37 

9.99 

9 

21 

2 

42 

10-11.99 

11 

le 

3 

48 

12-13.99 

13 

13 

4 

52 

14-15.99 

15 

8 

5 

40 

10-17.99 

17 

10 

6 

60 

18-19.99 

19 

1 

7 

7 

20-21.99 

21 

2 

8 

16 

22-23.99 

23 

1 

9 

9 

24-25.99 

25 

0 

10 

0 

26-27.99 

27 

1 

11 

11 



221 


247 


Source: Moody's iDveetors Service, Moody' $ InduatriaUi 193x. 


X 


9 . 


z + 


N 




7.24 


As in the previous method the computation is then carried out 
in the same manner and results in an average deviation about the 

guessed mean, in terms of class intervals. To convert 

this back to original values it is multiplied by the size of the class 
interval. The result may then be added to the guessed mean to 
obtain the arithmetic mean. 

This method may be generalized into formula form as follows; 


Where! 


X 


N 


C 


Z - Guessed Mean 
/ - frequency 


Thii formula li aho varlooily sivan by diffarant tezta as: 


•A +i*</)C 


1 - » + 


r/ (F - *) 


U - M- +t 



FREQUENCY DISTRIBUTION AVERAGES 


23 


The Mode 

Definition 

The mode is the most frequent or most common valut^, provided 
that a sufficiently large number of items are available to give 
a smooth distribution. 

The value of the mode will correspond to the value of the maxi- 
mum point (ordinate) of a frequency distribution if it is an “ideal’' 
or smooth distribution. • 

Computation 

It is not possible to make an exact mathematical determination of 
th^ mode. A number of methods rna}’' be used, however, to secure 
reasonably accurate approximations. 

The midpoint of the modal class interval may not be used as 
the value of the mode, since its value will change if the size of 
the class interval is changed. 

Reducing the size of the class interval will tend to delimit 
the value of the mode and tend more and more to have it coincide 
with the midpoint of the group of greatest frequency. This reduc- 
tion in size of the class interval is, however, decidedly limited by 
the number of items included in the distribution. If an inluiite 
number of items are available and an infinitely small class interval 
M used, the midpoint of the class interval of greatest frequency 
will be the value of the mode. 

In practice this ideal situation does not exist, so that an approxi- 
mation somewhat closer than the midpoint of the modal group is 
necessa^5^ 

In spite of the previous midpoint assumption the values wnthin a 
grou^ are not evenly distributed in a distribution, but there is a 
tendency to gravitate towards the point of greatest density. 

In the distribution below (table 7) the modal group is that 
containing 43 items, with class limits of .10% to .19%. Since 
there are a greater number of cases in the class above, .20% to 
.29 %, with a frequency of 32, than that belowh .00 % to .09 %, 
which contains 19 items, it^llow^s that the true point of greatest 
concentration will tend towards the upper class interval and will 
therefore be abov^ the midpoint of the modal group. 

The value of the mode may be approximated by resort to the 
formula: 

xMode = C = .10% + 3^^ (.10%) - .163% 

w^here 

Lm„ «= low er limit of modal group 
fa « frequency of class interval above modal group 
fb ” frequency of class interval below modal group 
C « size of class interval 

^ By below In rsforenoci to b olasa interval is meant in the direotion ol the loweet dais interval 
vtUue. 
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Table 7 — ComputaMon of Mode Moments of Force Method 
Percent of Population on Old Age Pensions in the United States, 
by Counties, 1930 

(Claaa iDtervale) Number uf Countiee 


Peroeot 

iFrequenoy) 

.00- 

.09% 

19 

.10- 

.19 

43 

.20- 

.29 

32 

,30- 

.39 

27 

.40- 

.49 

17 

.50- 

.59 

21 

.60- 

.69 

14 

.70- 

.79 

9 

.80- 

.89 

2 

.90- 

.99 

2 

1 .00-1 

.09 

0 

1.10-1 

.19 

0 

1.20-1 

.29 

1 


187 

Source: United States Bureau of Labor Statistics, Handbook of Labor 
Staiutics, Bulletin 451, 1931, pp. 483—487. 

The above technique is sometimes referred to as the moments 
of force method. 

Empirical Method 

Where the distribution is only moderately skewed another esti- 
mation of the value of the mode may be obtained from the rela- 
tionship that exists between the position of the mean, median 
and mode. 

In a smoothed curve (such as that shown in figure 10) the mode 
will be located at the highest point in the distribution, the posi- 
tion of the median will be somewhat to the right of the mode in 
the direction of the extreme values and will divide the area under 
the curve in half. The mean, since it is affected to the greatest 
degree by extreme values, will be furthest in the directicn of the 
extreme values in this hypothetical right skewed distribution. 

It has been found that for a moderately skewed distribution 
the distance between the mean and the median is one-third of the 
distance between the mean and the mode. 

In a left skewed distribution the same relationship will occur 
but in the opposite direction. 

Since the values of the mean and the median may be determined 
exactly, the value of the mode may be approximately determined 
through this relationship. 

Mode « Mean — 3 (Mean - Median) 

Other Mefhods 

Estimates of the value of the mode may be determined by a 
number of additional methods such as;^ 

■ More edveneed methode of dptenmniog the ▼elue of the mode ere outlined in Chepler XIV 
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Fig. 10 — Hypothetical right skewed frequency distribution showing 
theoretical position of mode, median and mean. 

1. The grouping method. 

2. By smoothing the frequency distribution 

3. By moving averages. 

4. By mathematical curves 
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Mode 

CharacferistUi 

1. By definition the mode is the most usual or typical value. 
Under certain circumstances it may be considered as the “normar* 
value. 

2. The value of the mode is entirely independent of extreme 
items. 

3. The mode is an average of position. 

Advanfog^s 

1. It is the most typical and therefore the most descriptive 
average. 

2. It is simple to approximate by observation where there are 
a small number of cases. 

3. It is not necessary to arrange the values or know them if 
they are few in number. 

'' Disadvanfagws 

1. The mode can be approximated ojily when a limited amount 
of data is available. 

2. Its significance is limited when a large number of values is 
not available. 

3. In a small number of items the mode may not exist, for none 
of the values may be repeated. 

The Geometric Mean 

The geometric mean is the nth root of the product of n items 
or values. 

Formula: 

G„ = \/Xi . Z, . Xn 

For instance the geometric mean of SI, $3 and $9 is 

G„ - a71 X 3 X 

= vT? 


To facilitate the computation of the geometric mean, its formula 
may be reduced to its logarithmic form. 

1 ^.. ^ A "2 + log + + log Xn 

log Um - ^ 

where 

G„ “ Geometric Mean 

It can be seen that the logarithm of the geometric mean is 
equal to the average of the logarithms of the items. 
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The geometric mean may be computed from grouped data by 
using the technique outlined for the computation of the arith- 
metic mean by the “long” method (see page 12), except that the 
logarithms of the midpoints are used in the calculations rather 
than the actual midpoint values. 

" CharactBriiUci 

1. The geometric mean is a calculated value and dependent 
upon the size of all the values. 

2. It is less afTccted by extreme items than the arithmetic mean. 

3. For any series of items it is always smaller than the arith- 
metic mean. 

Advanfaget 

1. It is a more typical average than the arithmetic mean since 
it is less affected by extremes. 

2. It may be manipulated algebraically. 

3. It is particularly useful m the computation of index numbers 
(see Chapter XIII). 

Disadvanfaget 

1. The geometric mean is not widely known. 

2. The geometric mean is relatively difficult to compute. 

3. It cannot be determined where there are negative values in 
the series or where one of the items is zero. 

The Quadratic Mean 

The quadratic mean is the square root of the mean square of 
the items (root-mean-square). 

Formula: 



The quadratic mean is used in the computation of the standard 
deviation (see page 34). 

The Harmonic Mean 

The harmonic mean of a series of values is the reciprocal of the 
arithmetic mean of the reciprocals of the values. 

Formula: 

J_ J_ 

1 X, X, X, 

N 

The harmonic mean is used in averaginR rates. 


+ _l_ 
X, 
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FREQUENCY DISTRIBUTION 
DISPERSION AND SKEWNESS 


Disperiion 

The average or typical value is of little use unless the degree of 
variation which occurs about it is given. 

If the scatter about the measure of central tendency is very 
large it is of little use as a typical value. It is therefore necessary 
to develop a quantitative measure of the dispersion (or varia- 
tion or scatter) which occurs about the average. . i 

The Range 

The range, the simplest of the measures of dispersion, is the 
difference between the minimum and maximum items in the 
series. It is sometimes given in the form of a statement of the 
minimum and maximum values themselves. 

The difference between these two values gives some idea of the 
degree of variation occurring in the series, but quite frequently 
the result is misleading. 

In series A and B below the range is 30%, but the dispersion 
is not the same. 

Hypothetical Examination Grades for Ten Students 

Qradb in Fkhcint 


student Number 

Ei&minetion A 

Examination B 

1 

60 % 

60 % 

2 

60% 

65% 

3 

61 r . 

70% 

4 

63% 

72% 

5 

65 % 

75% 

6 

65% 

78% 

7 

66 % 

80% 

8 

67% 

85 % 

9 

68% 

88% 

10 

90% 

90 % 


Characfer/sf/ci ^ 

1. The range is simple and readily understood. 

2. It is easily calculated. , 

3. Its value is dependent on two items only, the highest and 
lowest values. 

4. The distribution of the items between the two extremes is 
not necessary to obtain the range. 

5. Since the range is dependent only upon the two extremes, 
It is greatly affected by unusual occurrences when these two 
items are distinctly “out of line.” 
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^ Mean DeviaMon 

The range is dependent for its value entirely upon the two 
extremes. Obviously, where there are extrepaes a satisfactory 
measure of dispersion must be dependent upon the position of 
every value in the series. 

A simple method for determining the scatter of a series of 
values about a given point (i.e., the scatter of hits on a target) 
would be to take the average distance of the items from the given 
point (with the target this point would be the “bull's eye“). 
The smaller the average distance about this point the smaller the 
scatter or dispersion of the values. In a frequency distribution 
the average distance of the items from the measure of central ten- 
dency, such as the arithmetic mean, may be used for this purpose. 

Since, however, the sum of the deviations about the arithmetic 
mean total up to zero, in order to obtain the average value it is 
necessary to ignore signs. 


Tabu B-^SaUf Record of Clerk No. 148 In a New York City Department Store 


for the Month of June 1 932 



Dsta 

Number of Salei 
{X) 

Deviation from 
Monthly Average 
(x) 

S ‘ June 1 

15 

11.85 


2 

27 

.15 

> 

3 

31 

4 15 

a 

4 

27 

.15 


6 ' 

23 

3.85 

a 

7 

23 

3.85. 


8 

25 

1.85 

p 

9 

31 

4.15 

10 

29 

2.15 


11 

41 

15.15 


13 

17 

9.85 


14 

30 

3.15 


15 

45 

18.15 


16 

24 

2.85 

> jj 

17 

26 

.85 


18 

26 

.85 

, T 

20'" 

23- 

3.85 


21 

15 

11.85 


22 

37 

10.15 

£> 

23 

27 

.15 

d f ' 

24 

18 

8.85 

7 T, — " 

25 

39 

12.15 


27'' 

19 

7.85 

# ^ - 

28 

18 

8.85 


20 

21 

5.85 


^ 30 

41 

13.15 


Total 

698 

165.70 5- 


Average 

26.85 

6.37 


This meEtsure of dispersion is called the mean deviation. It 
consists of the average of the deviations of the items from their 
arithmetic mean or median. 
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Mean Deviation 
OorocCer/jt/cf 

1 . Tlic value of ihe mean deviation is dependent upon the value 
of every item in the scries. 

2. It may be computed about either the arithmetic mean or the 
median. 

3. The average deviation from the median is a minimum. 

Compufation Mean Deviation — Ungrouped Data 

Formula: 

MD = or 
N 

Where 

MD is the mean deviation 
^ \x\ = a sum of the deviations of each value from the 
arithmetic mean, siejns ign ored. 

U |d| = the sum of the deviations from another measure of 
central tendency such as the median, signs ignored. 
The mean deviation may be computed about either the mean 
or the median. When computed about the median it will be 
smaller than about the mean or, since the sum of the deviations 
about the median is a minimum, about any other value. 



Computation — Grouped Data 

When the data is grouped in the form of a frequency distribu- 
tion the value of the mean deviation may be determined as follows: 

1. Obtain the deviation of the midpoint of each class interval 

from the median (or mean). 

2. Multiply the deviations by the number of items (the fre- 
quency) in each class interval. 

3. Divide the total of the values obtained by the number of 
cases. 

A simpler method (arithmetically) is to: 

1. Select an arbitrary origin. 

2. Obtain the deviations of the midpoints about this more 
convenient value. The midpoint of the group in which the 
median (or mean) is located is selected as an origin. The 
deviations of the midpoint of each value is then determined 
(see d' in table 9, column 4) and multiplied by the frequency 
of each group (column 3). 

In the illustrated distribution the median (73.052 months) 
is located above the midpoint of the class interval used as the arbi- 
trary origin, (73); and therefore all of the deviations at that point 
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Table 9 — Computalfon of Moan Deviation 
Agei of Puplli In the Pint Half of Pint Grade in a New York City Public School 


(1) 

Age in 

Months 

Class 

Intervals 

(2) 

Midpoint 
{M P) 

ca) 

Number of 
Pupils 

(JO 

(i) 

Deviation 
from? (dO 
(In Claaa 
Intervals) 

(6) 

Frequency X 
Deviation 
(/dO 

68-69 9 

69 

« 12 

2 

24 

70-71 9 

71 

33 

1 

33 

72-73 9 

73 

57 

0 

0 

74-75.9 

75 

25 

1 

25 

76-77 9 

77 

9 

2 

18 

78-79.9 

79 

4 

3 

12 

80-^1 9 

81 

6 

4 

24 

82-83.9 

83 

2 

5 

10 

84-85 9 

85 

0 

6 

0 

86-87 9 

87 

0 

7 

0 

88-89 9 

89 

2 

8 

16 



150 


162 


or below are too small by the amount of the diflference between the 
two values, .052 months in actual units or .026 class intervals. 
There are 102 (12 + 33 -f 57) of these values and therefore their 
total understatement is 102 times the differences of .026 class 
intervals. 

The values above the arbitrary origin in similar fashion are 
overstated by the amount of the difference times the number of the 
values or .026 class intervals times 48. 


The understatement of the 102 values below the arbitrary 
origin is in part offset by the 48 items overstated, leaving only 
54 values understated. The average understatement will then be 
54 times .026 divided by the total number of items. If this cor- 
rection is added to the average deviation about the arbitrary 
origin the result a\i11 be the mean deviation in class intervals. 
Formula^ 


S Ifd'l (Ns-N,) c 
JV N 


162 (102-48) .026 

150 150 


1.0894 


> (a) The equation may be Bimplified to read 

_ X (/d) -Kisr, ~ Nl)c ^ 

(b) This oaloulation aeaumeB all values to be located at the midpoint in the median or 
mean group A more exact value mav be obtained by aeBuming an even distribution of these 
values throughout that group The following formula (see Handbook of Mathematical Statiattco 
H. h Rieta. Editor pp 29-31) may be used 


MD' 


X/d (W> - N^c + ( 25 + c») 

N '' N 


Where — number of items above meah (or mean) group 

Wk ~ number of items below median (or mean) group 
f. => number (frequency) of oases m median (or mean) group 
e ~ differenoea between arbitrary origin and median (or mean) 
MD in actual values is obtained by mulbplying AfD' bv C 
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Where 

MD' - Mean deviation in class intervals 
Ns “ Number of cases too small or understated 
Nl “ Number of cases too large or overstated 
c « Difference between arbitrary origin (midpoint of 
median or mean group) and median or mean 
The mean deviation in actual units wiU be obtained if the result, 
which is expressed in terms of class intervals, is multiplied by the 
size of the class interval. 

MD - MD' X C 
where 

C - size of class interval 
MD - 1.0894 X 2 - 2.1788 months 

The Standard Deviation 

The standard deviation is a special form of average deviation 
from the mean. It is computed by taking the quadratic mean 
(see page 27) of the deviations from the arithmetic mean of 
these values. The standard deviation is thus the root-mean- 
square of the deviations from the arithmetic mean. 

Formula: 

where . ^ ^ 

a - standard deviation* 

X - deviations from arithmetic mean 
N - total number of items, (f) 


Computation — Ungrouped Data^ 

1. Get the difference between each actual value and the arith- 
metic mean. 

2. Square the values thus obtained. Obtain the average of the 
squares. 

3. Take the square root of the resulting total. 


Computation — Grouped Data 

Where there are a considerable number of items in the series the 
calculation of the standard deviation can be more readily per- 
formed if the data are first grouped into the form of a frequency 
distribution. 


1 . The deviation of the midpoint of each group from the arith- 
metic mean is used as a measure of the average deviation 
from the mean of all items in that group. 


* The symbol 0 U thijartek imell letter StemA. 

* A more oonvenleDt^^rannuU for ungrouped dots may be derived from mn elgebreio menip- 


uletloo of the itAndbfd devioUmlt 
Bee teehniokl Appendix III. 


JX(Z’) 

(IIY 

\ N - 

\ N ' 
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Table 10— CompulaMon of Stondard Deviation— Un^rouped Data 
Bid Prices for 1 2 Joint Stock Land Bank Bondi on April 1 8, 1 934 


Bank 

Rata 

(Peroant) 

Bid Prioe 
(X) 

Deviation from 
Mean (70 5) 

(i - z) 

X 

X* 

Atlanta 

5% 

5 

71 

0.5 

.25 

Burlington 

65 

- 5.5 

30.25 

Chicago 

5 

41 

^ 29 5 

870.25 

Dallaa 

5 

80 

9.5 

90.25 

Denver 

5 

73 

2.5 

6.25 

Dea Moines 

5 

78 

7.5 

66.25 

Fort Wayne 

5 

71 

.5 

.25 

First Carolinas 

5 

69 

- 1.5 

2 25 

First Texas 

5 

71 

-5 

.25 

Lincoln 

5 

79 

8.5 

72.25 

Louisville 

5 

75 

4.5 

20.25 

New York 

6 

73 

2.6 

6.25 


Total 


846 

0 

1155.00 

Mean 


70 5 


96.25 


Source: Wall Street Journal. 


a « 



9.81 


Table 11 — Computation of Standard Deviation — Grouped Data — Long Method 
Percent of Tax Delinquency in 1 51 Cities of Over 50,000 Population in the 

United States, 1 933 


(1) 

Percent of T ax 
Deliniiuenoy 
(Claaa 
Interval) 

m 

(Midpoint) 
Af. P. 

(3) 

Number of 
Cities 

(Frequency) 

CO 

(4) 

Deviation 
from Mean 
(2h 74) 

(r) 

(5) 

(X*) 

(6) 

/(x») 

0- 4.99 

2 50 

1 

-24 24 

5^7 5776 

5S7 5776 

5- 9 99 

7.50 

12 

- 19 24 

370 177() 

4442 1 U3 

10-14.99 

12.50 

19 

- 14 24 

202 777(5 

3852 7744 

15-19 99 

17 50 

24 

- 9 24 

85 3776 

2019 0654 

20-24.99 

22 50 

19 

- 4 24 

17 9776 

341 5744 

25-29.99 

27.50 

19 

.76 

.5776 

10 9774 

30-v34.99 

32 50 

16 

5 76 

33 1770 

5 50 8416 

35->39.99 

37.50 

15 

10 76 

115 7776 

1730 6640 

40-44.99 

42 50 

12 

15 76 

248 3776 

2980 5312 

45-49.99 

47.60 

8 

20 7() 

430.977G 

3447 «2n8 

50-54.99 

52.60 

2 

25.76 

663 5770 

1327 1552 

65-59.99 

67.50 

0 

30.76 

940 1776 

0 

60-64.99 

62.60 

2 

35 70 

1278.7776 

2557 5.552 

65-69.99 

67.50 

2 

40.76 

1661.3776 

3322 7552 
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27187 4176 


Source: Duo & Bradstreet's Municipal Review. 

V S /(z*) /27, 187.4176 

JV " V 151 


13.42% 
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2. The average deviation of each group is squared to obtain the 
necessary deviation squared. 

3. The average deviation is multiplied by the frequency indi- 
cated for the group in order to obtain the total of the squared 
deviations for that group. 

4. The totals are then added for the entire distribution. 

5. The square root of the sura obtained after dividing by N is 
the standard deviation. 



Short MefhocJ 

The computation of the standard deviation may be simplifieil 
by the following method. 

1. Instead of using the arithmetic mean, compute the stand- 
ard deviation about a c<;n\enient]y select eil point. For 
this purpose the midpoint of any group may be selected. 
Since the quadratic mean of the deviations about the 
arithmetic mean is smaller than the quadratic mean of 
deviations about any other point, the resulting value will 
be larger than the true standard deviation. 

2. Subtract from it a correction value to obtain the necessary 
result. The value of the correction factor may be deter- 
mined by an algebraic manipulation of formula 
(see technical appendix 11). 


The resulting formula is 



where cr *= standard deviation 

d = deviation of midpoint of each class interval from that 
of arbitrary group. 

Where the class interv^als are uniform in size the calculation 
may be further simplified by carrying out all computations in 
terms of class intervals and then multiplying the final results by 
the size of the class interval. 

The formula will then read 

a 

where a 
d' 

/ 

C 


-C^ 

= standard deviation 

= deviation of midpoint of class interval from arbitrary 
origin in terms of class intervals. 

- frequency of values in class interval. 

- size of class interval. 
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^he application of this formula to the computation of the stand- 
ard deviation of the minimum line rates for national advertising 
for 249 daily newspapers is shown in table 12. 


Table 12 — Computation of Standard Deviation — Short-Croup Deviation Method 
Minimum Line Rate for National Advertising for 249 Doily Newspapers In 
Cities of 25,000 to 50^000 in the United States, 1933 


(1) 

Rate Per Line 
(In DoUari) 

(2) 

Number of 
Neweptipere 
(/) 

(3) 

Deviation from Z 
In ClaBB Intervals 

(d') , 

/d' 

/d'* 

S 01- 019 

2 

- 5 

- 10 

50 

.02- 029 

4 

-4 

- 16 

64 

.03-. 039 

23 

- 3 

- 69 

207 

.04-. 049 

30 

- 2 

- 60 

120 

.05-. 059 

40 

— X 

40 

40 

,06-. 069 

45 

0 

0 

0 

.07- 079 

35 

1 

35 

35 

.08-. 089 

25 

2 

50 

100 

09-.099 

12 

3 

36 

108 

.10- 109 

9 

4 

36 

144 

11- 119 

6 

5 

30 

150 

,12-. 129 

1 10 

6 

60 

360 

.13-. 139 

3 

7 

21 

147 

.14-. 149 

1 

8 

8 

64 

.15-. 159 

1 

9 

9 

81 

.16-. 169 

3 

10 

30 

300 


249 


120 

1970 


Source: Editor and Publisher, International Year Book for 1935. 



- S.Ol V (,7.6794) = (2.7711) f.Ol = $.0277 


Correction for Grouping 

The value computed by the grouping method will be subject 
to the assumption that all the values are located at the midpoint 
of each class interval and in part is dependent on the size of the 
class interval used. The error in the assumption is constant when : 

1. The distribution is "continuous'’ (see page 7), 

2. The distribution tapers off gradually in both directions. 
Under these conditions the true standard deviation is 

= (a' 2 - 1/12) 

wliere : 

cj' = standard deviation in class interval units 
C = size of class interval 
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Check on Computafion — Chorlier Check 

A simple check may be used in order to determine the accuracy 
of the computations preliminary to the actual substitution in the 
formula for the standard deviation. 

If an additional value i] / (d' -h 1)^ is computed it \Aill be seen 
that 

S f (d' + 1)2 « v/ (d'2 + 2ir + 1) 

« +2 ^ (/(i') H X (/) 

But since 

Zf ^ N 

.‘.x/Cd'-f 1)2 == i:/d'24 2 ^ (/d') -hiV 

These are the values used in the computation of the standard 
deviation. If the equation is fulfilled tiie prelimijaary values are 
correct. If the computation check is applied to the problem 
shown above the result is: 


Table 12a — Charller Check — Applied to Doto of Table 12 


(1) 

Rato Per Line 
(In DoUbxb) 

(2) 

Niunk>er of 
N'^wapapera 
if) 

(From 
Table 12) 
d' 

(4) 

d' + 1 

io) 

id' + 1)» 

(6) 

fid' + 1)» 

I.01-.019 

2 

- 5 

- 4 

1C) 

32 

.02-. 029 

4 

- 4 

- 3 

9 

36 

.03-. 039 

23 

- 3 

- 2 

4 

92 

.04-. 049 

30 

- 2 

- 1 

1 

30 

.05- 059 

40 

- 1 

0 

0 

0 

.06-. 009 

45 

0 

1 

1 

45 

.07-079 

35 

1 

2 

4 

140 

.08- 089 

25 

2 

3 

9 

225 

.09-. 099 

12 

3 

4 

16 

192 

10-.109 

9 

4 

5 

25 

225 

,11-. 119 

6 

5 

6 

36 

216 

.12-. 129 

10 

6 

7 

49 

4tK) 

.13-. 139 

3 

7 

8 

64 

192 

.14-. 149 

1 

8 

9 

81 

81 

.15-. 159 

1 

9 

10 

100 

100 

.16-. 169 

3 

10 

11 

121 

303 


249 




2159 


Zf {d' + 1)2 = 2459 = 

Zfd^^ 2 Zfd' N = 1970-f 2 (120) + 249 


CharaclerisMcs 

1. The standard deviation is affected by the value of cv^ery item . 

2. Greater emphasis is placed on extremes than in the mean 
deviation, this because all values are squared in the computation 
of the standard deviation. 

3. In a normal or bell shaped distribution the mean deviation 
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is .7979 (j. In a moderately skewed distribution this relationship 
is approximately true. 

4. (a) If a distance equal to one standard deviation is mea- 
sured off on the X axis on both sides of the arithmetic 
mean in a normal distribution, 68.26% of the values will 
be included within the limits indicated. 

(b) If two standard deviations are measured off 95.46% of 
the items will be included. 

(c) Three standard deviations measured off will include 
99.711% of the cases. 

The abo\ c percentages are exact only w^here the distribution is 
perfectly normal. In the case of a moderately skewed distribution 
the percentages arc approximations. As such they are indicated 
more generally as about 68% for one standard deviation on either 
side ol the mean, 95% for two, and practically all of the values 
(99.7%) for three. 

d'lie exact {)er cent of cases included for any number of stand- 
ard deviations measured from the arithmetic mean in one direction 
only may be found in table 31, page 110. 



Fig. 12 — “Normal” Distribution Showing the Percentage of the Area 
Included Within One Standard Deviation Measured Both Plus and 
Minus About the Arithmetic Mean. 
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5. Since 3 standard deviations on either side of the mean in- 
clude all (99.7%) of the cases in moderately skewed or normal 
distributions, the standard deviation is about }/io[ the range. 


The Quartile Deviation or Semi-Interquartile Range 


As the dispersion of a frequency distribution is increased the 
distance between the quartiles is enlarged. Since an increased 
dispersion will be indicated by a greater distance between the 
quartiles, this distance may be used as the basis of a measure of 
dispersion. 

If the distribution were perfectly symmetrical the two quartiles 
would be equi-distant from the median. One half of the distance 
between the quartiles would represent the distance between the 
quartiles and the median. 

One half of the distance between the quartiles may be used as 
a measure of the average distance of each quartile from the median. 
This value is used as a measure of dispersion. 


QD = 


where : 




2 


QD = Quartile Deviation 
Qa = Third Quartile 
Qi “= First Quartile 

If a distance equal to one QD is measured off on either side of a 
point half way between the quartiles, 50% of the values will be 
included between these limits. This midpoint value is assigned 
the letter K. K coincides with the median only in a symmetrical 
distribution. 


The 10-90 Percentile Range 

In a similar manner as the dispersion of the distribution is 
increased, the distance between the various percentiles will be in- 
creased, The distance between any two percentiles may thus be 
used as a measure of dispersion. The spread between tlie tenth 
percentile and ninetieth percentile is generally used for this pur- 
pose. This measure is known as the 10-90 percentile range 

10 — 90 Percentile Range = [\o - Pm 

This measure of dispersion has the advantage of being dependent 
upon a larger percent (90%) of the cases than the quartile deviation 
(50%) while excluding the unusual cases represented by the 
extreme 10% on either end of the distribution. 

Relafive Meaiures of Dispenlon 

The measures outlined above are absolute measures of disper- 
sion and therefore the resulting values cannot always be compared 
with significance. The standard deviation in months of the ages of 
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students in a given grade of a New York City school cannot be 
compared to the dispersion of the Intelligence Quotients in the 
same class where there is a standard deviation in percent because 
of the difference in units. 

In addition a measure of dispersion must be compared to the 
size of the average about which it is measured. For instance, 
a variation of five dollars in the price of a share of stock, the aver- 
age price of vhich is ten dollars, does not have the same value in 
relation to its average price as a share which has the same varia- 
tion but an average price of one hundred dollars. 

To relate the measure of dispersion to its average and to 
convert it to percentage form, the standard deviation is divided 
by the arithmetic mean. Stating this measure in percentage form 
solves the problem presented by the differing units. The resulting 
measure developed by Pearson is known as the coefficient of 
variation (F) 

V = 100 

X 

Other comparative coefficients of dispersion may be computed 
when using the other measures of dispersion. 

AD 

Median (or mean) 

Qa — Qi 

Vq = 2 ^ Qb — Qi 

Qa + Qi Qa + Qi 
2 

Skewness 

Skewness is a term for the degree ol distortion from symmetry 
exhibited by a frequency distribution. 

WTen a distribution is perfectly symmetrical, the values of 
the mean, median and mode coincide. In an asymmetrical 
(skewed) distribution the values of the averages will depart from 
one another. Since the arithmetic mean is most affected by 
extremes it will move the greatest distance from the mode. The 
mode is not affected at all by unusual values; therefore the greater 
the degree of skewness the greater the distance between the mean 
and the mode. 

It follows that this distance between mean and mode may 
be used as a measure of skewness, since the greater the lack of 
symmetry the larger the discrepancy between them. However, 
because the measure of skewness is used largely for comparative 
purposes, the problem of differing units will again make its appear- 
ance. A second difficulty arises in that the distance between the 
averages will be larger in a widely dispersed distribution than in 
one with a narrow dispersion. Both difficulties may be removed 
by dividing the measure by the measure of dispersion, 
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Sk “ 


Mean - Mode 

(j 


Since the distance between the mean and mode in moderately 
skewed distributions is three times the distance between the mean 
and the median fsee page 24), for this type of distribution the form- 
ula may be rewritten as follows: 


Sk - 


3 (Mean — Median) 


When the distribution is symmetrical the values of the mode 
and the mean will coincide. Under these circumstances the 
coefhcient of skewness will be zero. 

Where the curve is right skewed the extremely large values will 
increase the value of the mean. This results in increasing the 
value of mean over that of the mode, since the latter remains un- 
affected by the extremes. The coefficient will then be a positive 
value. If the distribution is skewed to the left, the extreme CLases 
will reduce the value of the mean. This makes it smaller than the 
mode and results in a negative coefficient of skewness. 

Another measure is based on the position of the quartMms. 
In a symmetrical distribution the quartiles are equidistant from 
the median. In a skewed distribution the quartiles will differ 
in their distances from the median. 


The greater the lack of symmetry the larger the discrepancy 
between the two distances of the quartiles from the median. If 
this is divided by the quartile deviation (the measure of dispersion 
based on the quartiles) the result is a coefficient of skewness. 


Formula: 


Sk •= 


(Qa - Median) - (Median - Qi) 

el) 


Since the distances will be equal for a symmetrical distribution, 
in this case Sk will equal zero. Where the distribution is right 
skewed the right Quartile (Qa) will be a greater distance from the 
median than Qj. The opposite will be true where the curve is 
left skewed. The resulting coefficient will then be negative. 


Kurtosis 

The ' 'peakedness” of the frequency distribution is another 
characteristic which might be measured. The measurement of this 
characteristic is outlined in Chapter XV. 
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CHAPTER V 


TIME SERIES ANALYSIS— TREND 


DeFinition 

The time series is an arrangement of statistical data in accord- 
ance with its time of occurrence. 

ClassificaHon oF Movements 

The analysis of the time series consists of the description and 
measurement of the various changes or movements as they 
appear in the series during a period of time. These changes or 
movements may be classified as. 

1. Secular trend, or the long time growth or decline occurring 
within the data. The period covered should include not less 
than ten years. 

2. Seasonal variation, or the more or less regular movement 
within the twelve month period. This movement occurs 
year after year and is caused by the changing seasons. 

3. Cyclical movement, or the swing from prosperity through 
recession, depression, recovery, and back again to prosperity. 
This movement varies in time, length and intensity. 

4. Residual, accidental or random variations, including such 
unusual disturbances as wars, disasters, strikes, fads, or 
other non-recurring factors. 

Measuremenf of Trend 

Four methods are commonly used for measuring trends; 

1. Freehand. 

2. Semi-average. 

3. Moving average. 

4. “Least squares'^ (see Chapter VI). 

Methods 

1. Freehand Method. To fit a trend by the freehand method 
draw a line through a graph of the data in such a way as to 
describe what appears to the eye to be the long period movement. 
A line of trend fitted by this method is shown in figure 13. The 
drawing of this line need not be strictly freehand but may be 
accomplished with the aid of a transparent straight edge or a 
“French” curve. 
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Source: U. S. Oeolorlcal Survey. 


Fig. 13— Average Daily Output of Electric Power in the United States, 
1919 — 1932. Trend indicated by a freehand line. 



Source: United States Department of Commerce. 

Fig. 14 — Consumption of Cigarettes in the United States, 1914 — 1931. 
Trend indicated by semi-average method. 
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Advanfagmi 

1 . The method is simple. 

2. TJie method may be used in place of a matheinetical equation 
which may not logically describe the trend. 

3. If draAvn with care the trend line fitted by this method will 
be a close approximation to a mathematically fitted trend. 

DiMadvantagei 

1. The results vary according to personal estimate. 

2. Considerable practice is required to make a good fit. 

2. Semi-average Method. In this procedure the data are split 
into two equal parts and the figures in each half are averaged. 
The averages thus obtained are plotted at the center of their 
respective periods (see figure 14) and a straight line is then drawn 
through the two points. 

In the example shonri below (table 13) the data are split into two 
parts (19Mr~1922 and 1923-1931). The two values obtained by 
averaging the figures m each half of the data (38.25 for the first 
half and 95.45 for the second half) are plotted at the midpoints of 
their respective periods (the middle of 1918 for the first group and 
the middle of 1927 for the second). With the aid of a ruler a 
straight line is then drawn through the two points. 

Table 1 3 — Compulation of Trend — Semi-Average Method 
Consumption of Cigarettes in the United States, 1914-1931 



C onHumptii>D 




(DillionB of 


Anthmebio 

Year 

CigaretteB) 

Totali 

Meani 

1914 

16 86 



1915 

17 9b 



1910 

25 29 



1917 

35 33 



1918 

46.60 

i44 28 + 9 « 

38.25 

1919 

53 12 



1920 

44 62 



1921 

50.87 



1922 

53.57 



1923 

64 45 



1924 

70 01 



1925 

79 96 



1926 

89.45 



1927 

97.18 

859 08 + 9 ==• 

95.45 

1928 

105.92 



1929 

119 04 



1930 

119 62 



1931 

113.45 




Advantagoi 

1. The method is simple. 

2. The result is entirely objective, i.e., it is not dependent upon 
individual estimate 



46 


STATISTICAL METHODS 


Di5ad¥anfages 

1. The semi-average method makes use of the arithmetie mean, 
which as we have seen before is greatly aiTected b> extreme 
values. For this reason the semi-average trend line may be 
pulled out of its true position by such unusual occurrences 
as strikes, etc. 

2. The method is used primarily for the fitting of straight line 
trends. 

3. Moving Average Method. In the moving a\erage method 
the trend is described by smoothing out the fluctuations of the 
data by means of a moving average. 

The moving average is a series of successive averages secured 
from a series of items by dropping the lirst item in each group 
averaged and including the next in the scries — thus obtaining the 
next average. To obtain a three item moving average, in the 
illustration below, the first three numbers (3, 5 and 7) are adtled 
(the total is entered in column 2 next to the middle item of the 
group). The first number (3) is then replaced by the next number 
in the column of figures (in this case 8) and the iiroce^s ih continued 
until the entire series has been included. Each total is then 
divided by three and the result is placed m column 3. 


(11 

(2) 

(3) 


3 Item 

3 Item 

Values 

Moving Total 

Moving A\ erage 

o 

5 

15 

5 00 

7 

22 

7 33 

10 

29 

9.67 

12 

30 

12 00 

14 

41 

13.07 

15 

46 

15.33 

17 




The fluctuations caused by the business cycle in an economic 
time series may be removed or partially eliminated by including 
in the moving average a number of items (years) equal to the 
length of the cycle which is evident in the data. Tiie cyclical 
fluctuations will thus be smoothed out and a better measure of 
trend obtained. 

Table 14 illustrates the application of such a moving average to 
the consumption of cigarettes in the United States. This table 
demonstrates the procedure to be followed in fitting a moving 
average consisting of both an odd number of items and an even 
number of items. The method for the odd period moving average 
consists of obtaining successive totals (in this case 7 items) by 
consecutively dropping the first item and adding the next in the 
Beries. The total for the first seven items in the illustration 
(table 14) equals 239.84. This sum is then pluced next to the mid- 
dle item of the group (1917). The second figure in column 3 
(273.85) is obtained by dropping the figure for 1914 and adding the 
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figure for 1921. This process is continued until all the figures in 
column 2 have been included. Each of the figures in column 3 is 
then di\ided by 7 to obtain the seven year moving average entered 
in column 4. 


Table 14 — ComputaMon of Trend — Moving Average Method 
Consumption of Cigarettes in the United States, 1914-1932 


(1) 

(li) 

D) 

1 (■» 


C>insur’ipt)ini 


1 S( pii ^ par 


(IMlioriH of 

Sc\ en ypar 

Mo\ mg A. prage 

A ear 

Cijiarottpa) 

Mi»\ lUK Total 

(Col :i iliMileii bv 7) 

191 i 

1 19 S9 



191,) 

17 99 



1911) 

2r».29 



1917 

<)i ) . .t > 

2.19.84 

24.29 

1918 

49 ()(i 

272 ^'7, 

29.12 

1919 

r.:;.i2 

209.49 

44.21 

1920 

41.92 

.2 IS ()2 

49 SO 

J921 

7)0.87 

2S1 20 

54.79 

1922 

.AS 7)7 

4)h (4) 

50.51 

1<‘2S 

91 r> 

47)2 2-, 

()4.70 

1921 

70 01 

7)05.49 

72.21 


79.99 

5f,0 54 

80 08 

1921) 

89.45 

929 01 

89.42 

1927 

97. IH 

9S1 IS 

97.21 

1928 

105 92 

724.92 

102 52 

1929 

119 04 

74S.24 

109.89 

1 920 

1!9 92 



HKll 

11 i.1.5 



1 9:12 

102 58 




Source: Uiiiteci States Department of Coiiuiu'ice. 

When nil even number of ilcnis i.s included in the moving 
average fas (> years in tabic 15 l)elo\^ ) tlie center jinirit of the 
group will be between tw’o years. Jt is, therefore, necessary to 
adjust or shift (known technically as etmter) these averages so 
tliat they coincide with the years. Columns and 4 (six year 
moving average and six year moving total) may be obtained by 
the methods outlined for the odd period average as explained 
above. To center tln^ values a two-item moving average is taken 
of the eA^en item moving average. 

A twv) year moving average is taken of the six year average. 
The resulting average is located between the twv) six year moving 
aATrage values and therefore coincides with the years. The end 
result (a two yea^ moving average of a six year irioving average) is 
known as the six year moving average centered. 

Adranfaget 

1. Only simple computations are involved. 

2. It may replace the fitting of complex mathematical curves. 

Disadvantage! 

1. It cannot be brought up to date. Depending upon the 
number of items included, the last point in the trend must 
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Table 15 — Computafion of Trend — Moving Average Method (Even Year 

Period) 

Coniumplion of Cigarettes in the United States 1 91 4>1 932 


(1) 

(2) 

0) 

(4) 

^ 

(5) 

e— 

(6) 





Two-Vear 

Si I- Year 

^ ear 

Coneumption 

Sii-l'ear 

Sii-Year 

Moving 

Moving 

(Billionn of 

M nving 

IVloving 

Total of 

Average 


Cigarettea) 

Total 

Average 

Col. 4 

Centered 

1914 

16.86 





1915 

17.96 





1916 

1917 

1918 

1919 

1920 

25.29 

35.33 

46.66 

53.12 

44.62 

195 22 

222.98 ; 

255 89 1 

284 17 

313.29 

336.64 

363.48 

408.31 

454.62 

506.97 

561 .56 
611.17 
644.66 
658.79 

32 54 
37.16 
42.65 
47.36 
52.22 
56.11 

60 58 
68.05 

69.70 

79.81 

90 01 
99.58 

34.85 

39.91 

45.05 

49.79 

1921 

50.87 

108.33 

54.17 

1922 

1923 

1924 

53.57 

64.45 

70.01 

116.69 

128.63 

143.82 

58.35 

64.32 

71.91 

1925 

1926 

1927 

1928 

1929 

1930 

79.96 

89.45 

97.18 

105.92 

119.04 

119.62 

1 in . n 

84.50 

93.59 

101.86 

107.44 

109.80 

160.22 

178.09 

195.45 

209.30 

217.24 

80.11 

89.05 

97.72 

104.65 

108.62 

1931 

113.45 





1932 

103.58 
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occur several years before the end of the data. A five vear 
moving average ends three years before the end of the data^ 
a seven year average four years, etc. 

2. The concept of trend involves the idea of a smooth growth 
or decline. The moving average is usually irregular in 
appearance. 

3. A moving average fitted where the trend is that of a concave 
(upturning) curve will be higher than the true trend at all 
points, and lower in the case of a convex trend. 

4. The movirg average is computed by the use of the arithmetic 
mean. This form of average is greatly affected by extreme 
values. Because of this fact the moving average will be 
pulled decidedly out of line by such unusual events as 
strikes, disasters, etc. 

5. The number of items giving the smoothest moving average 
is equal to the number of years included in the average 
length of the business cycle in the data. Since this average 
length must be estimated by the statistician the estimation 
will vary from person to person and, therefore, the method 
is not purely objective. 


ADDITIONAL BIBLIOGRAPHY* 

Sutcliffe, William G., Statistics for the Business Man, pp. 
196-200. Harper & Bros,, New York, 1930. 

* For readiuBB in Bbondard BUtistioi teitbooka boo the QUICK REFERENCE TABLE TO 
STANDARD TEXTBOOKS following Table of ConlwiU 
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TIME SERIES ANALYSIS— TREXD 
1 HE LEAST SQUARES METHOD-LINEAR 

Formulae for Straight Lines 

If a line is drawn on a g!;raph its foimula may be lead by in- 
spection. 

The formula for line d in figure 10 is determinefl by obtaining 
the \ allies of and Y as indicated iiy the line itself. 



Fig. 16 

For Line d 


When A" equals Y equals 

0 0 

1 1 

2 2 

3 3 


The formula for line d is, therefore, 

Y » A^ 
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The fonnula for lines e and / are obtained in exactly the same 
manner: viz: 

Line fi r - 2X 

Line / F - 3 X 

The value of the coefficient of X (the number or constant which 
precedes A'' in the equation, as 3 in Y « 3 X) indicates the number 
of units the value of Y will increase for each unit increase in X. 
For reference purposes this constant is given the letter 6. The 
equation may now he written more generally as 

y = 6X 

l"he gre ater t he valu e of the constant h the ip ore^ rapid the rise 
in the line! The value of b j ^refore is the measure oTTfiBlSlQ^ 
of the line, .ii-thniinc ^^Hren3^Fdownwa^d7ind^^ a decl*ea$6' 
in y for eafh uJut increase in X^ the b value will be neg;ative. 


The Y Intercepf 



Line g 

If the values of A^ and F for line g in figure 17 are obtained 
the result will be: 
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When X equals Y equals 

0 1 

1 2 

2 3 

3 4 

4 5 


There is a unit increase in the value of F for each unit increase 
in X, For this reason the value of h will be 1. It should be 
noted, however, that the Y value is constantly one unit greater 
than the A" value, making the formula for this line of trend: 

y = 1 + 1 X 

The formula for line h is: 

y « 1 + 2X 

since, in addition to Y being 1 when is 0, it can be seen that for 
each unit increase in X there is a tv.o unit increase in Y . 

The value of the new constant is equal to the value indicated 
on the y or vertical axis at the point at which the line crosses it, 
when A^ is equal to zero. This point is indicated by the arrow in 
figure 18 for lines g and h. The new constant 1 is assigned the 
letter a. 


For Line j x equals Y equals 

0 2 

1 4 

2 6 


The increase in Y again is two units for each unit increase 
in X. The slope constant b is therefore equal to 2. However, 
the line crosses the Y axis (when A" equals zero) at 2 and as a 
result the a constant is equal to 2. The equation is then repre- 
sented by : y *» 2 -|- 2 X 

The constant a is sometimes called the Y intercept because the 
value is determined by the point at which the line crosses the Y 
axis (when X equals zero), while h indicates the slope of the line. 

The formula for any straight line may be written generally as 

y - a-h6X 

The Least Squares Method 

If a straight line trend is assumed, the line of trend will have a 
formula of the type; 

y « a-h6X 

In this formula the values of a and h must be determined. 
The formula y - a -f 6 X will, however, describe any one of an 
infinite number of lines. It is necessary therefore to decide 
which line best describes the data. The principle of least squares 
aids in determining the line that best describes the trend of the 
data. The principle states that a line of best fit to a series of 
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values is a line the sum of the squares of the deviations (the dif- 
refences between the line and the actual values) about which 
will be a minimum. There can be only one line having thi^ 
qualification. 

Leaif Squarei Line 

The least squares line for a given series may be obtained by the 
use of a set of ‘‘normar' equations. These "normal" equations 
are derived mathematically (see technical appendix IV) but for 
working purposes they may be obtained by multiplying the "type" 
equation, in this case^ Y a hX 

through by the coefficients of each unknown (a and 6). The 
coefficient of the first unknown (a) is 1. Multiplying the type 
equation through by 1 we have: 

y = a + 

The formula must be summed up for all points. The summation 
results in (j) E( F) - la + 62 (X) 

But, the sum of a equals the number times the constant; viz, 

2a -= Na 

since the sum of a number of constants will equal the constant 
multiplied by the number of times (N) it appears. The result 
may be written as : 

(I) 2(7) - Va-h62(X) 

The coefficient of the second unknown (6) is X. Multiplying 
the type equation (7 = a + 6X) through by X we obtain: 

X7 

This sums up to: 

(II) 2(X7) -o2(X) +62(X») 

By the use of these two equations the values of the two unknowns 
may now be determined and the trend line fitted. 

Application of the Least Squares Method 

The application of the least squares method for the determina- 
tion of the trend for the production of aluminum in the United 
States is shown in table 16. The equation, since it is assumed 
to be linear (straight line), must be of the type 

Y ^a + bX 

from which the two "normal" equations are obtained. 

(I) 2(7) ^Na^bZ(X) 

(II) 2(X7) »a2(X)-f b2(jr*) 

In order to solve for a and b the following values are necessary: 
2(X);2(y);S(X7);2(X0; N 

■ Hie method outlined here le not e derivatioD but rather e ehort method for obtoininB the 
imoiMMry "normal" equmtiona 
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Time is invariably placed on the X (horizontal) axis and for 
this reason constitutes the X variable, with production as the Y 
(vertical axis) variable. 

The sum of X or ^ (X) is obtained by adding the numbers of 
each year; for S ( 7) the production figures for all years are 
totaled. 

The numbers of the years (as 1919, 1920, etc.) are inconvenient 
for calculating purposes and, since the numbering system for the 
years is merely arbitrary, the usual numbers are replaced with a 
simpler series of consecutive numbers. The correspondence be- 
tween the two series must be noted by indicating the original 
number of the year to which the number of zero is now assigned. 
The year to which the number 0 is arbitrarily assigned is known 
as the origin year. The new numbering system is shown in 
column 2 of the work sheet below. 

Table 16— Compuration of Trend — Leait-Squaret Method 
Annual Production of Aluminum In the United Statei, 1916-1930 


(1) 

Year 

(2) 

X 

(3) 

Froduotion of 
Aluzunum 
(MilUaat of 
Pouode) 

Y 

(4) 

XY 

(5) 

X* 

1916 

0 

no 2 

0 

0 

1917 

1 

143 3 

143 3 

1 

1918 

2 

143.3 

286 6 

4 

1919 

3 

134 5 

403.5 

9 

1920 

4 

138 0 

552 0 

16 

1921 

5 

55 0 

275.0 

25 

1922 

6 

74 0 

444 0 

36 

1923 

7 

129 0 

903 0 

49 

1924 

8 

150 0 

1200.0 

64 

1925 

9 

140.0 

1260 0 

81 

1926 

10 

145 0 

1450.0 

100 

1927 

11 

160.0 

1760.0 

121 

1928 

12 

210 0 

2520.0 

144 

1929 

13 

225.0 

2925 0 

169 

1930 

14 

229 0 

3206.0 

196 


2(X) - 105 

V(Y) • 2186.3 

^iXY) - 17328.4 

2 (X«) -1015 


N * the number of years (15 in illustrated problem) or items 
as the case might be. 


Substituting the values obtained in the two normal equations: 
I 2(7) «iVa + &2(X) 

II 2(X7) .a2(X) + 62(X*) 

I 2186.3 » 15a -f 1055 
II 17328.4 - 105a -h 10156 
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The equations may be solved simultaneously by obtaining equal 
values for the coefficient of one unknown (a or 6). 

If equation (I) is multiplied by 7 : 

I 15304.1 = 105a 4- 7356 
II 17328.4 = 105a + 10156 


Equation one subtracted from equation two gives 
2024.3 - 2806 
, 2024.3 


280 


7.23 


Substituting 7.23 for 6 in the original equation (I) gives- 
2186 3 = 15a + 105 (7 23) 

2186.3 = 15a -I- 759.15 
1427.15 - 15a 
a « 95 14 

Having obtained the values for a and 6, the normal equation can 
now be written with numerical coefficients and the formula for 
the line of trend written as : 


Y « 95.14 4- 7.23X 

In interpreting the equation it is necessary to state the origin 
year and the units used in the enumeration of the originel values. 

The equation as finally stated will then read : 

Trend of Annual Production of Aluminum 
in the United States 1916-1930 

Y - 95.14 -h 7.23X 
Year of origin 1916 

Unit: in millions of pounds 


Graphic Prefantalion of Trends 

To obtain the various trend values of Y (in order that the 
trend line may be drawn on the graph) the various values of X 
indicated for each year on the work sheet are substituted in the 
equation. For 1918 the X value indicated in the work-sheet is 2. 
The X of the original equation is replaced by this value. 

Y - 95.14 4- (7.23) (2) 

95.14 4- 14.46 

Y « 109.60 for the year 1918. 

This process is repeated until the trend value for each year is 
obtained. Since two points determine a straight line, in practice 
all that is needed to plot the line are two values for two different 
years. 
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Short Method for Computing Trend — Odd Number of Years 

The procedure may be simplified when an odd number of years 
is used for the trend computation. The middle year may be 
taken as the origin year, thereby assigning to it an X value equal 
to zero. A minuB sign is generally given to the X values for the 
years previous to the origin year and a plus sign to those following 
the origin year. 

Applying this technique to the problem worked out above it 
will be found (see column 3 of table 17) that the sum of the X 
values will be zero, since they will consist of two like arithmetic 
progressions equal in amount but opposite in sign. This will 
modify the "nonnar' equations: 

I S(r) -Ara + 6S(Z) 

II S(XF) - a2(X) 

since 2(X) - 0 

a2(X) - 0 
and 61! (X) « 0 

thus the normal equations are simplified to : 

12(7) = Na 

II 2(^1^) “ 

and the need for a simultaneous solution is now eliminated. 

Table 1 7— Camputallon of Trend^Loait-Squares S^alght Line — Odd Number 
of Yean — Short Method 

Annual Production of Aluminum In the United Statei, 1916-1990 


(1) 

(2) 

(3) 

(4) 

(5) 

Ymr 


Produation of 
Aluminum 
(Milboni of 




Founds) 




X 

Y 

XY 


1916 

- 7 

no 2 

- 771.4 

49 

1917 

- 6 

143.3 

- 859.8 

36 

1918 

- 5 

143 3 

- 716.5 

25 

1919 

- 4 

134.5 

-538.0 

16 

1920 

- 3 

138.0 

- 414 0 

9 

1921 

- 2 

55.0 

- 110 0 

4 

1922 

- 1 

74.0 

- 74.0 

1 

1923 

0 

129.0 

.0 

0 

1924 

1 

150 0 

150.0 

1 

1925 

2 

140.0 

280.0 

4 

1926 

3 

145.0 

435.0 

9 

1927 

4 

160 0 

640.0 

16 

1928 

5 

210.0 

1050.0 

25 

1929 

6 

225.0 

1350.0 

36 

1930 

7 

229.0 

1603.0 

49 


Z{X) - 0 

1(7) - 2186.3 

Z(Xr) - 2024.3 

2(X*) - 280 


Source: United States Bureau of Mines. 
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Substituting these values in the two simplified “normar* equations; 

I ^(Y) - jVa 

II E{XY)^bliX^) 

the result is; 

I 2186.3 -15a 

a - 145-75 

II 2024.3 - 2805 

b - 7.23 

The resulting equation is written as follows; 

Trend of Annual Production of Aluminum 
in the United States 1916-1930 
Y - 145.75 + 7.23X 
Origin: 1923 

Unit' in millions of pounds. 

Since a represents the value of Y when X equals zero, when 
a different zero point is used for X the a value is accordingly 
changed. As an example, the dilTerenco in the a value in the two 
trend equations (i.e , tliosc obtained respectively by the long and 
short method) is due to llie different points of origin. In the short 
method 1923 was taken as tlie origin year, in the long method 



Fig. 18 — Annual Production of Aluminum in the United States 1916- 
1930. Trend indicated by a "least squares" straight line. 




58 


STATISTICAL METHODS 


1916. The value of X was zero in 1923 for the short method, and 
zero in 1916 in the long method. 

The equations can be shown to give the same results by using 
the corresponding X values for any year in both formulae. 

If the production trend figure for 1923 is desired let X = 7. 
From table 16 substitute in the formula obtained the X value 
indicated for 1923. 

F = 95 . 14 -h 7 . 23 X Origin 1916 
(in millions of pounds) 

with the result (for 1923): 

Y = 95.14 + 7.23 (7) 

Y = 145.75 

From table 17 for 1923 let X =» 0 and substitute it in the formula 
as obtained by the simplified method: 

Y = 145.75 + 7.23 Origin 1923 
(in millions q' pounds) 

and the result is (for 1923) : 

Y « 145.. f 7.23 (0) 

Y « 145.75 

Thus the two trend values correspond. 

Shifting fhe Origin 

A change in origin is merely a change in the starting point for 
the computation of the trend figure. 

The h value (the measure of slope) is not affected by a shift in 
origin because no matter where the starting point be taken the 
line of trend will always have the same slope. In both equations 
in the previous paragraph the b (value) was 7.23 in spite of the 
differing origins (1916 and 1923). 

The a value indicates the value of Y when X equals zero. The 
following method may be used to change the point of origin of 
the first equation (1916 origin) to a 1923 origin: 

Y - 95.14 + 7.23X 
origin 1916 

(in millions of pounds) 

In the new equation the origin is relocated at a point 7 years 
beyond the prior origin. By substituting 7 for X the value for Y 
must be equal to 145.75. 

Since the slope is unaffected by a change in origin h = 7.23. 
Substituting in the normal equation: 

F - o + 6X 
145.75 - a + 7.23X 
But since X =- 0 in 1923 

/. a « 145.75 
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Summary of Steps for Shifting Origin 


1: Use original equation 
2: Substitute value for X for 
new origin year 
3: Obtain Y (trend) value 
4: Replace old a value with 
this new amount 


Y 95.14 -h 7.23 X (origin 1916) 
X 1 since the 1923 trend value 

is desired 

Y = 145.75 

Y 145.75+ 7.23 X (origin 1923) 


Short Method — Even Number of Years 

in applying the short method to a time series with an even 
number of years the fact that there is no middle year presents a 
difficulty. However, the middle point of the series may be used 
by assigning an origin between the two center years (as January 1, 
1924).^ The first year on either side of the origin will now be 
one half year away from the point of origin and as a consequence 
will be assigned as numbers + .5 or — .5, as the case may be. 
The second year on either side wall be numbered + 1.5 and — 1.5. 
This numbering system is shown in table 18, column 3. 

Since decimals or fractions are cumbersome the trend equation 
may be obtained more readily by working in terms of half years, 
thus eliminating decimal values (see column 4). 

Table 16 — Computation of Trend — Least Squares Straight Line Short Method 
— Even Year Period 

Trend Expansion of F. W. Woolworth Co., 1916-1931 


(1) 

Yeare 

(2) 

Y 

Average 
Number of 
Storea 

(3) 

X 

(Id Yeara) 

(4) 

X' 

(Id Half 

Y eara) 

X'Y 

X'^ 

19X6 

920 

- 7.5 

- 15 

- 13800 

225 

1917 

1000 

- 6.5 

- 13 

- 13000 

169 

1918 

1039 

- 5.5 

- 11 

- 11429 

121 

1919 

1081 

- 4.5 

- 9 

- 9729 

81 

1920 

1111 

- 3.5 

- 7 

- 7777 

49 

1921 

1137 

- 2.5 

- 5 

- 5685 

25 

1922 

1176 

- 1.5 

- 3 

- 3528 

9 

192:i 

1261 

- .5 

- 1 

- 1261 

1 

1924 

1364 

.5 

1 

1364 

1 

1925 

1420 

1.5 

3 

42t)0 

9 

1926 

1484 

2 5 

5 

7420 

25 

1927 

1588 

3.5 

7 

11116 

49 

1928 

1727 

4.5 

9 

15543 

81 

1929 

1828 

5.5 

11 

20108 

121 

1930 

1890 

6.5 

13 

24570 

169 

1931 

1896 

7.5 

15 

28460 

225 


SF- 21922 

0 

1 

0 

2X'r- 46612 

SX'* - 1360 


Source: United States Department of Commerce; Survey of Current 
Business. 


^ When a figure for a year is indioated in a time aerieB it repreBenta the figure as of the middle 
of the year, or Jub' 1 of that year. 



60 


STATISTICAL METHODS 


These values may now be substituted in the two simplified normal 
equations; ^ 

(II) S(X' F) = bSCX'”) 

Resulting in . 

(I) 21922 - 16a 

a « 1370.13 

The equation then reads ; 


Trend of Number of Stores in Woolworth Chain 1916-1931 
Y = 1370.13 + 34.27 X' 

Origin: January 1, 1924 
Unit: Number of stores 
X' in i years 

The equation in its present form is difficult to handle. For 
this reason it should be converted to the standard form which has 
as its origin July 1 of the year. 

To shift the origin to the center of 1924 (July 1) it is necessary 
to obtain the trend value ( Y) at the new point of origin and use 
that value for a. 

The trend value for \ year later may be obtained by substitut- 
ing + 1 for X' in the equation (since X' is in terms of half years). 


Y - 1370.13 + 34.27 X 1 

Y =- 1404.40 


Y may now be used as the new a value and the equation written ■ 
y - 1404 40 + 34 27 X' 

Origin 1924 (July) 

X' in i years 

Finally, an adjustment must be made to convert X' (in half years) 
to X"^ (in years). 

The symbol 6 used as the coefficient of X represents the increase 
per half year (in this case 34.27 stores). To obtain the increase 
per year b is doubled (68.54). The equation is now written: 

Trend of Number of Stores in Woolworth Chain, (1916-1931) 
y =. 1404.40 + 68 54 X 
Origin 1924 

Unit: Number of stores. 

Least Squares Method 

Ac/rontooei 

1. The method expresses trend in the form of a mathematical 
formula which may be easily interpreted. 

2. Results obtained under the method are definite and in- 
dependent of any subjective estimate on the part of the 
statistician. 
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3 . The resulting equation is in convenient form for extrapolation 
(extension into future or past). 

OlMdranfag^t 

1. The technique used is mathematical. 

2. The method is based on the assumption that me data follows 
a trend that can be expressed by a mathematical equation. 

ADDITIONAL BIBLIOGRAPHY* 

Fisher, R. A., Statistical Methods for Research Workers, pp. 

120-123. Oliver & Boyd, Edinburgh, 1932. 

Odell, C. W., Educational Statistics, pp. 189-199. Century Co., 
New York, 1924. 

Sutcliffe, William G., Statistics for the Business Mav, pp. 
201-216. Harper Bros., New York, 1930. 

* For readings in standard Statistics textbooks, see the QUICK REFERENCE TABLE TO 
STANDARD TEXTBOOKS following Table of Contents. 
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TIME SERIES ANALYSIS— TREND 
NON-LINEAR TRENDS 

The straight line trend does not satisfactorily describe the 
trend of data which have a varying rate of growth. For example, 
in the illustration below only a curved line can accurately describe 
the trend of the data. The trend of gasoline exports is shown in 
figure 19. 
millions 

OF BARRtL5 


J7] 



1915 1916 1911 I9IB 1919 I9ZO 1921 1922 1923 I9Z4 1923 I92G 1921 l9Zfi 1929 1930 

Source: United States Bureau of Mines. 

Fig. 19 — Gasoline Exports from the United States, 191B-1930. 

Methods of Fitting Non-Linear Trends: 

a: The Potential Series. The parabola is the simplest type of 
curve used to describe the trend of data. The formula for a curve 
of the simplest parabolic type is: 

The fitting of a curve of this type follows closely the method 
of fitting a linear equation as explained in the previous chapter. 
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Here, however, there are three unknowns **c”) in the 

equations, necessitating three normal equations for its solution.^ 

MmlhoAs 

1. Write down the "type"' equation: 

Y ^a + bX + 

2. Multiply each term of the equation by the coefficient of 
each unknown and sum up: 

a: For “ Normal'^ equation one 

Multiply the type equation by the coefficient of "o” 
(which is 1) and sum up 

Equation (I) 2(7) - + 6S(X) + cS(X*) 

b ; For normal equation two 

Multiply the type equation by the coefficient of h 
(which is X) and sum up 

Equation (II) Z{X F) = ai:(X) + 6i:(X2) + cI(X^) 
c: For normal equation three 

Multiply the type equation by the coefficient of "c”, 
(X^), and sum up 

Equation (III) ^(X* Y) = a2(X*) + bI>(X^) + ci:(X*) 

TabU 1 9— Compufation of Trend—Leoir Squares Method— Second Degree 

Parabola 


Gaioline Exports from rhe United Srates, 1915-1930 


(1) 

j ears 

(2) 

X 

(3) 

Exports 
(Millions 
of Barrels) 
Y 

(4) 

XY 

(5) 

X’ 

(fl) 

x*r 

(7) 

x> 

(8) 

X' 

1915 

0 

2.7 

0 

0 

0 

0 

0 

1916 

1 

8.5 

8.5 

1 

8 5 

1 

1 

1917 

2 

9.9 

19.8 

4 

39 6 

8 

16 

1918 

3 

13.3 

39.9 

9 

119.7 

27 

81 

1919 

4 

8.9 

35.6 

16 

142.4 

64 

256 

1920 

5 

16.3 

76.5 

25 

382.5 

125 

625 

1921 

6 

12.7 

76.2 

36 

457.2 

216 

1296 

1922 

7 

13.8 

96.6 

49 

676.2 

343 

2401 

1923 

8 

20.1 

160.8 

64 

1286.4 

512 

4096 

1924 

9 

28.3 

254.7 

81 

2292.3 

729 

6561 

1925 

10 

30.6 

306.0 

100 

30t)0.0 

1000 

10000 

1926 

11 

42.5 

467. 5 

121 

5142.5 

1331 

14641 

192" 

12 

44.3 

531.6 

144 

6379.2 

1728 

20736 

1928 

13 

52.9 

687.7 

169 

8940 . 1 

2197 

28561 

1929 

14 

62.1 

869.4 

196 

12171.6 

2744 

38416 

1930 

15 

65.6 

984.0 

225 

14760.0 

3375 

50625 


120 

431.5 

4614.8 

1240 

55858.2 

14400 

178312 


Source: United States Bureau of Mines. 


1 In order to aolve for a given number of unknown! k! shown in an equation it is neoessary 
to have the same number of equations involving these unknowns. 
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The resulting normal equations are : 

Equation (I) 2 ( 7) = ATa + 62 X + cS ( X^) 

Equation (II) 2(X 7) = a2(X) + 62(X*) + c2(X*) 
Equation (III) 2 ( X* 7) = a2 ( X*) + 62 ( X’) + c2 ( X‘) 
From the values of X and 7 

2(X), 2(7), 2(X7), 2(X*), 2(7^), 2(X*7) and 
V can now be determined (as outlined in the previous 
chapter) and substituted in the normal equations shown 
above. A simultaneous solution is then used to obtain 
the desired values for the unknown constant.* 

The fitting procedure is outlined in table 19. 

(1) 2( Y) ■=Na + 62(X) + c2(X’) 

431.5 -16«-( 1206 + 1240C 

(II) 2(X 7) = a2(X) + 62(X^) + c2(X’) 

4614 8 - 120a + 12406 + 14400c 

(III) 2f X^ 7) = a2(X’) + 62(X=) + c2(X^) 

55858 2 = 1240a 4- 144006 + 178312c 
Solving equations I and II to eliminate "a”: 

(II) 4614.8 = 120a + 12406 + 14400c (Equation II) 

(I) 3236.2.5 = 120a + 9006 + 9.300c (Equation I times 7 5) 

(IV) 1378.55 - 3406 + 5100c Subtracting (1) - (2) 

(III) 55858.2 = 1240a + 144006 + 178312c (Equation III) 

(I) 33441.25 = 1240o + 93006 + 96100c (Equation I times 77.5) 

(V) 22416.95 = 51006 + 82212c 

(IV) 20678.25 = 51006 + 76500c (Equation IV times 15) 

1738.70 = 5712c 

c = .3044 

Substituting the value of c in equation IV 
1378.55 = 3406 + 5100 (.3044) 

1378.55 - 3406 + 1552.44 
.-. 6 - - 1.5114 

Substituting the values for 6 and c in equation I 
4614.8 = 120a + 1240 (- .5114) + 14400 (.3044) 

120a = 865.576 
a - 7.2131 

* For review of the method of eimultaneoue eolutione where there ere more than two 
unknown! eee any atandard textbook on Elementary Alcebra ae Hardlni A M & Mullina 
Q W , CoUecpe Algthra 
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The final trend equation then reads: 

Trend of Gasoline Exports from the United 
States, 1915-1930 
Y » 7.2131 - .5114 X -f .3044 
Origin 1915 

Unit : Millions of Barrels 

The trend values for the various years may now be obtained b> 
substituting the appropriate values of (as indicated in column 2 
table 19). l^hus for 1925, substituting 10 for X 

Y = 7.2131 - .5114 (10) -f .3044 (I0)a 

Y = 7.2131 ~ 5.114 + 30.44 

Y - 32.5391 Millions of barrels 

rhe method can be further simplified when the data consists of 
an odd number of items. The middle year is selected as the year 
of origin. Therefore, ^ (X) and 1! (X^) will then be equal to zen*. 
The normal etpiations are: 

Equation (I) E( Y) ^ No + + c^{X^) 

Equation (II) X Y) = r/:i: ( A) -f 6E ( ^ c^{X^) 
Equation (III) I( A^ y) ^ (x^) + h^{X^) -j ri:( A') 

However, since both i](X) and E(X'^) =» 0 the normal equations 
are reduced to. 

Ecjuation (Ij -(A) = Aa-|-ci;(A^) 

Equation (II) i:(A Y) = ^;A(A'’) 

Equation (III) ^(A* y) ^ aI.{X^) -f- r^( A^) 

The values of a and c are then obtained simultaneously in the 
usual ^^ay, while h is obtained directly. 

A more flexible curve than the second degree parabola may be 
obtained by using a parabola of a higher degree.^ 

The third degree parabola lias the formula 

F = a + 6X + cX2 + dA^ 

The general formula for this type of curve is 

y = a + hX -f cX^ -f dX^ -f eA"* . . . etc. 

These more elaborate forms of equations will generally tend to 
follow the data more closely but must be used with care if they are 
to describe the trend of the figures rather than the cyclical or 
seasonal movement. 

The solutions for parabolas of a higher degree (dX®, eX*, etc.) 
may be arrived at in the same manner. The normal equations for 
the more complex formulas are obtained as before from the type 
equation and the values of the various unknown coefficients 
n, b, c, d, etc. secured through the method previf)usly described 

> The “deiirM'' of ftn eciuation oorresponda to the largest eiponeoi in tbs equation 
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b: Exponential Series. Occasionally neither the straight line 
nor the parabola will be appropriate for describing the trend of a 
particular series. This occurs, for instance, where the trend is 
geometric in nature. One curve descriptive of the geometric type 
of trend has the formula: 

Y - ab^ 

This type of trend appears w^hen the Y values tend to form a 
geometric progression ^ (such as the series 1, 2, 4, 8, 16, etc.) and 
the X values are arranged in the form of an arithmetic progression 
such as the series 2, 3, 4, etc. If plotted on semi-log paper (with 
logarithmic ruling on “ Y'' axis) a linear type of trend will make 
its appearance, and the resulting curve is therefore referred to 
as the semi-logarithmic curve. (See Chapter XVIIl, p. 159). 

If a geometric progression is formed by the “ values when the 
" X'’ values are arranged geometrically the formula is: 

Y = aX^ 

This type of curve will be linear on logarithmic paper (logarthmic 
rulings on both X^^ and " Y^' axis). 

The formulae of the exponential type such as those above may 
be fitted readily by reducing them to logarithmic form. 

Formula 7 = a 6' 

reduced to logarithms reads 

log y « log a -h X log b 

The normal equations may then be obtained as above. 

There are a number of special exponential curves of some 
importance for trend purposes. One of the more important curves 
of this type is known as the Gompertz curve. The formula is: 

y = a 6c* 

ADDITIONAL BIBLIOGRAPHY* 

Fisher, R. A., Statistical Methods for Research Workers, pp. 

133-150. Oliver & Boyd, Edinburgh, 1932. 

1 A geometrio proKreBsiDo is a Beries id which the values inorease at a constant ratio 

• For readings in standard Staliatics textbooks, see the QUICK llLrERUNCE TABLE TO 
STANDARD TEXTBOOKS following Table of Contents. 



CHAPTER VIII 

TIME SERIES ANALYSIS 
SEASONAL AND CYCLICAL ANALYSIS 


Seasonal Varialion 

Seasonal variation is the technical term given the more or less 
regular movements within the year recurring periodically year 
after year. 

Each month has a typical position in relation to the rest of 
the year. The problem of seasonal variation is to determine this 
typical or average position of each month. 


Melhods of Measuring Seasonal Varialion 

The most generally used methods for measuring the seasonal 
variation occurring within a time series are the : 

1. Simple average method. 

2. Link Relative method. 

3. Ratio to moving average method. 

4. Ratio to trend method. 


Table 20 — Average Weekly Freight Car Loadings in U. 5., 1919-1933 


Year 

Jan. 1 

Feb. 

March 

1 Apr. 

May 

JuDe 

July 

Aug 

Sept 1 

Oct. 1 

Nov. I 

Deo. 1 

Aver. 





(Unit 

1000 Cahb) 






1919 .. .. 

728 

687 

697 

715 

759 

809 


892 

960 

967 

807 

758 

803.8 

1920 . . 

820 

770 

848 

731 

862 

860 

901 

968 

969 

1005 

864 

723 

862.2 

1921 .. . 

701 

683 

692 

706 

757 

765 

751 

810 

841 

929 

761 

083 

7.56.9 

1922.. . 

702 

761 

826 

723 

787 

842 

825 

877 

935 

992 

044 

838 

838.0 

1923 . . . 

845 

B42 

917 

941 

975 

1011 

986 

1041 

1037 

1078 

978 

826 

956.4 

1924 . . . 

8.18 

908 

916 

875 

895 

906 

894 

974 

1037 

1091 

975 

847 

931.3 

1 92.1 . . . 

921 

901 

924 

941 

968 

989 

980 

1080 

1074 

1107 

1024 

888 

983.0 

1926 . . . 

923 

919 

969 

9.18 

1037 

1026 

1049 

1104 

1148 

1205 

1008 

904 

1026.0 

1927 . . 

946 

9.16 

1002 

975 

1024 

999 

979 

1062 

1097 

1115 

956 

834 

995.4 

1928.. .. 

862 

897 

Oil 

931 

1002 

985 

980 

10.18 

1117 

1175 

1061 

883 

992.7 

1929 . . 

893 

942 

962 

996 

1051 

10.12 

1038 

1117 

1135 

1169 

978 

835 

1014.0 

1930 

837 

876 

883 

912 

914 

930 

895 

938 

931 

950 

798 

680 

878.7 

Total 

10040 

10156 

10587 

10408 

11031 

11176 

11148 

11921 

12281 

12783 

11234 

9699 


Averages 

836.6 

840.3 

i 

682.3 

1 

867. 3| 

919.3 

1 

931.3 

929.0 

993.4 

1023 4 

1005.2 

936.2 

808.3 
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Table S1 — Compulation of Index of Seasonal Variation 
Simple Average Method 


Freight Cor Loadings in the United States, 1919-1939 


(l) 

Month 

^ (2) 
Average 
for Month 
(from Table) 

(3) 

Trend 

Correction 

(4) 

Corrected 

Average 

(5) 

Index of 
Seasonal 

January. 


— 

' 826 6 

.92 

I ebniary 

8n> 

- 1 1 

K44 9 

93 

March .... 

882 { 

2 S 

879 5 

96 

April 

May 

8(.7 

- 42 

863 1 

.95 

91 9. i 

- 5 0 

913 7 

1 00 

June 

9U ^ 

- 70 

921 3 

1 01 

July 

929 0 

- 8 4 

920 6 

1 01 

August . 

99 i i 

- 98 

983 6 

1 08 

September 

102 ^ 1 

- 11 2 

1012 2 

1 11 

October 

1005 2 

- 12 0 

1052 6 

1 15 

November 

9 ;() 2 

- 11 0 

922 2 

1 01 

December 

Total 

Average 

80H 

- 15 4 

792 9 

10916 2 

912 2 

87 


The Simple Average Method 

1. Avera^z;e (arithiiiclic mean) the values for eaeh month for all 
the years fsee table 20). I'he result is the typical value for each 
r)f the twelve months. 

2. Adjust for trend. Each of the averages just computed will 
be distorted by the secular trend of the data. If the trend is 
upward, December will be higher tlian it should be in relation to 
trend since it occurs later along the trend line. 

The increase per month due to trend rna}" be determined by 
fitting a “least squares line” to the average vwnthly figvrrs for each 
year and dividing the h value (slope) by 12. The resulting value 
will then represent tlie amount each monthly average is distorted 
due to trend as compared to the previous month. 

Thus to reduce the February average to the level of the first 
month, January, the amount of the trend increment may be sub- 
tracted from that average. To reduce March to the January 
level, it is necessary to subtract from it 2 times the trend incre- 
ment, for April 3 times, etc. (see table 21, column 3). 

3. The resulting corrected averages may then be expressed as 
a percentage of the average of the entire period (845 76). These 
values are knowm as the indices of seasonal variation. The figure 
of 93% for January means that the figure for January is typically 
7% below' the average for the year 
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Link Relah've Method 

The first step in the link relative method is to express the 
value for each month as a percentage of the previous month. 
From the data for Biiumiiioub Coal Production in the United 
Sti\tes 1914-1929 the figure for January, 1914 (40.19) is divided 
into the figure for February, 1914 (35.47), the figure for February, 
1914 (35.47) is divided into that for March, 1914 (45.46) etc. The 
resulting percentage figures as 88.26% for Febniary, 128.16% 
for March, etc., are called link relatives (see table 22). 

Table 22 — LinU Relallvei — Bituminous Coal Production in the United Stately 

1914-1 929 


Year 

Jan. 

Feb. 

March 

April 

Mav 

June 

July 

Aug. 

Sept. 

Oot. 

Nov. 

Deo. 

1014. . . 


88.20 

128 16 

.91.91 

120.92 

no 02 

100.23 

110.03 

103.36 

96.59 

88.. 59 

113.39 

191f«. . . . 

99.29 

7S.00 

108.46 

94.25 

10.9.24 

109.70 

104.74 

107.28 

107.34 

107.9 S 

101.22 

102.39 

1916 

101.70 

97 00 

96 99 

70 73 

115 37 

97 27 

100.98 

112.04 

98 59 

106.44 

100.27 

98.15 

1917 

10S.78 

86.20 

115 77 

87.42 

11 2.. 52 

90 43 

98.87 

102.33 

95.23 

107.16 

98.66 

92.35 

191R.. . 

Ofi.BO 

103 67 

109.89 

95.70 

109.. 57 

101 39 

107.49 

100.25 

92.87 

102.10 

83.94 

91.53 

1919. . . 

10.') 00 

76.08 

100.82 

95 39 

116.75 

98 69 

11.) 23 

100.41 

no 55 

118.6.5 

33.23 

195.90 

1920 

I33.r)9 

82.53 

116.54 

81 00 

102.79 

115.7! 

99,76 

lOH 65 

100.54 

106.05 

98.69 

101.29 

1921 

77.27 

76 60 

98.. 51 

90 66 

1 120.99 

101.70 

89.64 

113 66 

101.64 

124.. 59 

82.37 

8.5.98 

1922 

123 00 

lOH 99 

122 41 

31.46 

128.58 

100.95 

70.19 

1.52.05 

158.67 

110.06 

100.36 

102.64 

1923 . . . 

10^.01 

84.01 

111.00 

90 91 

107.26 

98 72 

99.21 

108 29 

9 4 58 

106.42 

' 87.27 

92.82 

1924. . . 

127.33 

90 08 

87.29 

73.70 

106.08 

97.46 

106.01 

107 71 

117,98 

114 23 

87.97 

109.90 

192') 

111 61 

75 OS 

96.52 

89.5.5 

105.28 

104.76 

106.49 

113 39 

104.32 

113.64 

95.45 

104.00 

1926 

101.31 

86 79 

99.05 

R0.88 

97.46 

107..51 

'03..51 

106 04 

105 66 

111.47 

109.38 

96 57 

1927. . . . 

99 no 

93.01 

113 68 

.57 05 

102.37 

102.88 

1 92 11 

123.96 

100 53 

104,96 

92 3.3 

101.66 

1928 

107.40 

9,3 .53 

100.31 

73.22 

! 113 76 

98 20 

1100.89 

11.3.31 

100 46 

121.94 

91.42 

94.22 

1929 

120.19 

91 sr 

i 83 24 

1 9.1.75 

|l08 91 

94.77 

106.74 

lOcS.Dl 

101.42 

115.10 

89.18 

101,12 


The link relatives for each month (all the January’s, etc.) are 
then averaged. The typical position of each month in relation 
to the previous month is thus obtained. l'"or instance, June is 
typically 101.6% of May. Since the arithmetic mean of the 
monthly averages would be distorted by unusual monthly values, 
the median is used as the averaging method since it is less disturbed 
by extreme values. 

The median (or typical) link relatives show the relation of 
each month to the month before but not to the rest of the months, 
Tiierefore, it is necessary to establish a relationship between 
these various links or convert the link relatives into a series of 
chain relatives. This is accomplished by arbitrarily setting the 
value of January as 100%. The median link for February is 
87.5%, which indicates that February is typically that percent 
of January and therefore is 87.5% of 100%. The median link 
for March establishes its chain relative as 102.6% of the chain for 
February (87.5%) or 89.8%. This computation is continued by 
multiplying each of the median link relatives by the chain for the 
preceding month. This process is repeated for all of the twelve 
averages; and, in addition, for the second January median link 
which is the value indicated for the first January repeated (table 
23, column 3). 
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TabU 23 — Computation of Soosonal Variation (Link Rolatlvo Method) 


Month 

(1) 

Median 

Link Relative 

(2) 

Cham 

Relatives 

(3) 

Adjusted Cham 
Relatives 

January 

107.5 

100.0 

100 0 

February 

87.5 

87.5 

86.9 

March 

102.6 

89.8 

88.7 

April 

May 

87.2 

78.3 

76.6 

109.2 

85 5 

83.8 

June 

101.6 

86.9 

84.1 

July 

102.3 

88.9 

85,6 

August 

108.5 

96.5 

92 6 

September 

101.5 

97 9 

93 4 

October 

109 0 

106.7 

101 7 

November 

91.9 

98 1 

92 5 

December 

101 2 

99 3 

93 2 

January 

107 5 

106 7 

100 0 


A discrepancy exists between the chain relative for the first 
January and that for the second. The difference is due to the 
trend increment which makes each succeeding January higher or 
lower than that of the preceding year. 

The difference between the two values (6.7) thus represents 
the trend increment. It is necessary to adjust the chain relatives 
for the effect of trend, therefore increasing multiples of one- 
twelfth of the discrepancy from each chain value — starting with 
1/12 for February, 2/12 for March, etc., must be subtracted out. 
The chain relatives will be then reduced to the same level as 
January ‘ (see table 23, column 3). 

The end result is an index of seasonal variation with a base of 
January. 


Ratio to Moving Average Method 

1. The seasonal variations in the data are smoothed by means 
of a twelve month moving average. The differences between the 
actual values and this moving average are due to seasonal move- 
ments. 

2. The ratio of each value to the corresponding moving average 
values for each month is then obtained. 

3. The ratios are then averaged for each month of all the years, 
using either the mean or median for this purpose, 

4. The resulting averages are the indices of seasonal variation. 

RaHo to Trend Method 

The ratio-to-trend method measures the seasonal variation and 
in addition the combined cyclical and residual variations. 

■ The trend disorepanoy may be distributed on various other bases (see Mills, F. C., Sforit- 
lical Methods, p 320). 
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Method 

Express each actual monthly value as a percent of its corre- 
sponding trend value, as computed by the “least squares” method* 
(see table 24, column 4). 

Since trend is 100% for all of these values, if plotted graphically 
the resulting graph would be that of the original data expressed 
in percentage form with trend removed. 

The ratios averaged for each month over 

the entire period of years. If used in averaging the arithmetic 
mean may be distorted by extreme values, therefore these values 
are excluded before averaging.^ 

The extreme or unusual values may be located by means of a 
multiple frequency table. The multiple frequency table is a 
multi-column frequency distribution of the ratios (A /T) with 
one column for each month (illustrated in figure 20). 
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Fig. 20 — Multiple Frequency Table of Ratios of Building Contracts 
Awarded to Trend Values for Each Month, 1919--1933. 


i The monthly trend valuee may be readily obtained by htting the “leait aquaree'’ line to 
the annual averaKes and divnlinK the "b" value by 12. Binoe the annual ecauation haa ita 
origin at the middle of the > ear (July 1) it will be neoeaeary to ahift the origin H mouth forward 
to center it on the month. This may be aaQompliabed by adding yi of the “b" value to “a". 

■ If the median la used for the averaging prooev it may not be typioal when there art few 
itema in the group averaged. 
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When the average is computed the unusual figures (indicated 
by circles in the multiple frequency table) are excluded from the 
average. 

The resulting average ratios of actual values to trend will show 
the typical relation of each month to trend or the seasonal indices 
(see table 25). For example, the value of 102% for March means 
that that month is typically 2% above the trend for the month. 

If the seasonal index for each month is subtracted from its 
respective ratio of actual to trend, the seasonal variation will be 
eUminated from the series, leaving as the only remaining fluctua- 
tions the combined cyclical and residual (random). 


Table S4 — Compufatlon of Seaionol and Cyclical VarlaHon — Ratio to Trend 

Method 

Roadbuildin3 Contracti Awarded for Concrete Highways and Streets In the 
United States— 1919-1933 

(Two Years Shown Only) 


(1) 

Yabt And Month 

(2) 

Contrsota 

Awarded 

(Million 

Square 

Y arda) 

(3) 

Trend 

(4) 

(5) 

Index 

of 

Seasonal 

(6) 

Cyclioal 

and 

Residua 1 


A 

T 

A/T 

1 -f 5 

C -t-R 

1919 






January ... 

.27 

5.17 

.05 

.61 

- .46 

February 

.78 

5.20 

.15 

.57 

- .42 

March 

2.87- 

5.23 

.45 

1.02 

- .57 

April 

5.01 

5.26 

.95 

1.64 

- .69 

May 

9.4n 

5.29 

1 78 

1.50 

.28 

June 

0.61 

5.33 

1.24 

1.37 

- .13 

J Illy 

5 75 

5.36 

1 .07 

1 18 

- 11 

August 

8.15 

5.39 

1.51 

1.16 

.35 

September 

S.84 

5.42 

.71 

.99 

- .28 

October 

2.79 

5.45 

.51 

.80 

.29 

November 

2.01 

5.48 

.37 

.59 

- .22 

December 

3.11 

5.52 

.56 

.67 

- .11 

1920 






January 

1 .96 

5.55 

.36 

.51 

- .10 

February 

4.22 

5.58 

.76 

.57 

.19 

March 

6.25 

5.61 

1.11 

1.02 

.09 

April 

5.79 

5.64 

1.03 

1.64 

- .61 


5.61 

5.68 

.99 

1.50 

- .51 

June 

2.94 

5.71 

.51 

1.37 

- .86 

July 

2.63 

5.74 

.46 

1.18 

- .72 

August 

2.04 

5.77 

.35 

1.16 

- .81 

September 

2.95 

5.80 

.51 

.99 

- .48 

October 

1.45 

5.8.3 

.25 

.80 

^ .55 

November 

1.32 

5.87 

.22 

.59 

- .37 

December 

2 01 

5.90 

.34 

.67 

- .33 


Source: Portland Cement Association. 
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Table 25 — Computarion Seasonal Index — Ratie-to-Trend Merhod 


Muath 

Monthly* 

Total 

(iiU9-io:i;n 

Number of* 
Moiaba 

Used 

Monthly 

Average 

January 

7.05 

15 

.51 

February 

8.54 

15 

.57 

March 

15.40 

15 

1.02 

April 

2a. 02 

14 

1.64 

May 

18.11 

12 

1.50 

June 

19.29 

14 

1.37 

July 

17.78 

15 

1.18 

August 

17.39 

15 

1.16 

September 

14.80 

15 

0.99 

October 

12.08 

15 

.80 

November 

8.87 

15 

.59 

December 

10.04 

15 

.67 


** Does not inolude '‘extrenie" inonths, oee multiple frequency table. 


Ratio-to-Trend Method — Summary 

1. Fit a trend line to the data. 

2. Cc^inpatc tlie ratio of each acfual value to its respective 
trend (A/T). 

[i. Averapio the ratios for each month, usin^^ the arithmetic 
mean for avera^^ing. First, however, eliminate extreme 
values loc/atcd by means (d the multiple frequency table. 
The resulting figures are the indices of seasonal variation. 

4. Subtract ihe respective indices of seasonal variation from 
the ratios for each montli. The resulting series represents 
tlic cyidical and residual llueUiations occurring in the series. 


For rpadii K-s in St.il i.«i{ ira tpxibookp. ppp »he QUICK UKKKHr.NJC'K TAHI.i. '1 

ST WI>\KD TEX rM<)()KS fullowiriK 'I'ablf* of CniitonlH. 



CHAPTER IX 
LINEAR CORRELATION 


It is often desirable to observe and measure the relationship 
which occurs between two or more statistical series. It may, for 
instance, be desirable to know whether there is a relationship 
between changes in the cost of living and changes in wages; the 
grades on an examination and the intelligence quotient of a group 
of students; the amount of electrical current passed through a 
solution and the amount of substance deposited by electro- 
chemical reaction; the length of time elapsed and the amount of 
academic material retained in memory after various intervals of 
time; and many other similarly associated (correlated) series. 

The relationship, or more accurately the association, between 
series may be established and measured by means of the correlation 
technique. 



^ ^ Whral ^ ^ 

Dollare per Buahcl 

Source: Wheat prices — Daily Trade Bulletin. 

Flour prices — Northwestern Miller. 

Fig. 21 — Scatter Diagram of Relation Between Wheat Prices 
and Flour Prices, by months, 1914 — 1933. 
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The Scatter Diagram 

If two related (associated) series are plotted graphically with 
one variable placed on the X axis and the other on the Y axis, 
the result is known as the scatter diagram.^ If there is a definite 
relationship resulting from plutting the associated variables on a 
chart the points will follow a definite line of movement or "path” 
as in figure 21. 

If the relationship were perfect it is obvious that for every given 
value on the X axis, there would always be indicated a certain 
value on the Y axis. In a situation like this all the points would 
coincide with a curv^.. or line instead of forming a path across the 
face of the scatter diagram. In figure 22 the figures of a bond 
yield index are plotted against those for the bond price index. 
Inasmuch as one series was computed from the other the relation- 
ship, of course, is perfect. 



Source; Standard Trade and Securities Service, Standard Statlstlcfl Corporation. 

Fig. 22 — Scatter Diagram of Relationship Between Standard Statistics 
Index of Bond Yields and Index of Bond Prices. 


When the series are imperfectly associated a definite value of Y 
will result when a given value of X is selected. In accordance 

* The independent or oaueual variable li placed on the X axis while the dependent variable 
ie plaoad on the Y axif 
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with the more or less imperfect relationship the variation will 
cause the points to depart from the indicated line or curve, 
creating a scatter. If there is a high degree of association the 
scatter will be confined to a narrow *‘path.'’ The less perfect the 
relationship between the two sets of data, the greater will be the 
departures from the indicated line or course. These departures 
are known as a "scatter.*' 

Line of Regression 

The trend or direction of this movement may be defined by 
means of a “least squares" line or a curve (see Chapter VI). ^ The 
resulting curve is known as the line of regression. If the trend 
of the data is linear (non-linear regressions are treated in chapter 
X), Jbhe resulting equation will be of the type 

Y ^a^hX 

The values of a and b are obtained from the “normal" equations; 

(I) 2(7) - ATtt-f 62(X) 

(II) 2(X7) -a2(X) -f 52(X») 

as in the case of the straight line trend (see pp. 55). 

Standard Error of Estimate 

This equation is used bo estimate a theoretical value of 7 for 
a given value of X, If the relationship is not perfect the actual 
will not coincide with the theoretical values, because of the scatter 
or variation about the line. If the scatter is definitely measured 
the variation may then be allowed for and a range established 
within which all values will fall. 

The measure used for this purpose, the standard error of 
estimate, is similar to the standard deviation. The standard 
deviation measures the variation or scatter about the arithmetic 
mean, while the standard error of estimate is a measure of the 
variation or scatter about the line of regression. 

The standard deviation is the average (quadratic mean) of the 
deviations about the arithmetic mearij while the standard error of 
estimate is the average (quadratic mean) of the deviations about 
the line of regression. 



Where 

iSV =“ Standard error of estimate 
d «« deviation of actual values (7) 

from theoretical ( 7^), or ( 7 - 7^) 

• Other method! of fitting luoh QUrvei may he need. Benerallv ^ith Ibib ubbIuI reiulU 
(aompara the freehand method) 
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The standard error of estimate may be used in the same manner 
as the standard deviatirm. One standard deviation measured off 
plus and minus about the arithmetic mean includes 6S% of the 
cases; and one standard error of estimate will also include 68% 
of the cases when measured off plus and minus about the line of 
regression.! 

Number of PeroeDi of 

Standard Errori Casat Induded 


- .0745 5 , 

- 1.0000 
* 2 0000 
-i 3.0000 


50% 

68 %, 

95% 

99.7% 


Table !6 — ComputaHon of Coefficient of Correlatlon~Ungrouped Data 
Circulation and Minimum Line Ratei for National Advertiiing in 90 Dally 
Newspoperi in New Englond, 1933 


ri) 

(2) 

(3) 

(4) 1 

(5) 

(6) 

(7) 

(b) 

(0) 


Ciroula- 

Rate per 




Theoret- 




tion 

Line (In 




loal Re- 



News- 

(Thou- 

Cente) 

(X 11 

X* 

r* 

(CreBaion 

(Y - Yc) 

d> 

papers 

eanda) 

(n 




Valuea 

d 



(X) 





Yc 



n 

166 

33 

.5478 

27556 

1089 

34 

- 1 

1 

2 

UK* 

42 

806 1 

36Sb4 

1764 

38 

+ 4 

16 

3 

301 

57 

17157 

90601 

3249 

55 

+ 2 

4 

4 

149 

30 

4470 

22201 

900 

31 

- 1 

1 

Ti 

ni 

25 

2875 

13225 

625 

25 

+ 0 

0 

6 

108 

23 

2484 

11664 

520 

25 

- 2 

4 

7 

440 

75 

33450 

198916 

5025 

78 

- 3 

0 

8 

381 

65 

24765 

U5161 

422.5 

68 

- 3 

0 

y 

399 

70 

27930 

159201 

4900 

71 

- 1 

1 

10 

158 

32 

5036 

204964 

1024 

33 

~ 1 

1 

11 

451 

79 

J.5629 

203401 

6241 

79 

H- 0 

0 

12 

133 

27 

3591 

17689 

729 

29 

- 2 

4 

13 

108 

22 

2376 

11664 

484 

25 

~ 3 

9 

14 

154 

30 

4620 

23716 

900 

32 

- 2 

4 

1j 

331 

47 

10857 

53361 

2209 

44 

+ 3 

0 

16 

150 

32 

4800 

22500 

1024 

31 

-h 1 

1 

17 

403 

70 

28210 

162409 

4900 

71 

- 1 

1 

18 

149 

32 

4768 

22201 

1024 

31 

4 1 

1 

19 

343 

65 

22295 

117649 

4225 

62 

■t 3 

9 

20 

247 

50 

12350 

61009 

2500 

47 

+ 3 

9 

21 

117 

25 

2925 

13689 

625 

26 

— 1 

1 

22 

231 

47 

10857 

.53361 

2209 

44 

43 

0 

23 

217 

43 

9331 

47089 

1849 

42 

4 1 

1 

24 

196 

42 

8232 

38416 

1764 

39 

+ 3 

9 

25 

166 

33 

5478 

27556 

1089 

34 

- 1 

1 

26 

124 

25 

3100 

15370 

625 

27 

- 2 

4 

27 

182 

35 

6370 

33124 

1225 

36 

~ 1 

1 

28 

166 

33 

5478 

27556 

i 1089 

34 

- 1 

1 

29 

112 

28 

3136 

12544 

784 

26 

+ 2 

4 

30 

177 

35 

6195 

31329 

1225 

36 

- 1 

1 


6468 

1252 

322227 

1725088 

60650 



125 


Source; Editor and Publisher, Internalwnal Ytarhook for 1083. 


I^(y^) /6065U p252\‘^ 


' It la aaaunied that there la B normal or approximately normal diatribution of the valuea 
about the line of regreeaion. For a more complete diaouBaion of theae percentage figurea see 
obapter XI 
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The procedure for obtaining the “least squares” line of regression 
and the standard error of estimate for the data in table 26 is out- 
lined below. 

(I) S(r) =Na + hZ(X) 

(II) i(zy) = a2;(X) + 6i:(X2) 

(I) 1252 - 30o -h 64685 

(II) 322227 - 6468a -h 1,725,088 05 
(I) 269931 * 6468a + 1,394,500.85 (Equation 1 times 215.6) 

Subtracting 52296 = 330,5875 

5 = .1582 

Substituting the value of 5 in Equation I 
1252 * 30a + 6468(.1582) 

30a - 228.7624 
a - 7.6254 

Line of Regression 

Y, - 7.6254 + .1582 X 
where 

Yc = Minimum rate per line in cents. 

X = Circulation m thousands. 

The standard error may now be computed by first obtaining the 
theoretical regression values from this equation (column 7, 
table 26) and obtaining the difference between these and the 
actual values (column 8, table 26). 

For instance the regression value for paper #3 with a circulation 
(X) of 301 is obtained as follows:^ 

Yc « 7.6254 + .1582(301) 

Yc = 55 ^* 

The standard error of estimate may then be computed from : 



= VTT? = 2.04^ 


On the basis of the equation obtained ; 

Y = 7.6254 + . 1582 X 

and with a standard error of estimate of 2.04jf, a new newspaper 
that reaches a circulation of 400,000 can be expected to have a 
minimum linage rate for national advertising of between 65fif to 77^ 
(3 standard errors — 99.7% chances out of 100). 95 papers out of 

100 (2 standard errors of estimate) with this circulation would 
have a minimum rate between 67^ and 750. 

1 The eymbol K, ie used to differenoiat« the tbeoretioal resreaeion value from the aotual 
value. 
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The range of values is obtained by substituting the circulation 
value for A" : 

Fc- 7.6254 -f .1582(400) 

resulting in a theoretical value of for the estimated linage rate. 
If 3 standard errors of estimate are then measured about this value 
99.7% of all papers can be exjjected to fall within this range, 
while if 2 standard errors of estimate are added to and subtracted 
from this value 95% of the newspapers would be included within 
this range. 


CoeKicient of Correlation 

The size of the standard error of estimate is a measure of the 
degree of association between series. The larger the value of the 
standard error of estimate the greater the scatter about the line 
of regression and, of emirsc, the poorer the relationship. 

Standard errors cannot always be directly compared how^ever, 
this because the standard error is expressed in terms of the original 
unit of the Y variables, and fiequently the units are different as 
in, for instance, the association of the intelligence quotient and 
examination grades compared with the association of the intelli- 
gence quotient and the number of w^ords misspelled in a spelling 
test. 

Very often a more limited range is possible in one of two vari- 
ables. For instance, if there are only 20 words on the spelling 
test and the numbor siielled correctly is taken as the score, the 
/ariation about the regression line is limited by this range as 
compared with the wucler variation permitted by the limits of 
0 to 100% on an arithmetic test. If the standard error of estimate 
is divided by the standard deviation (of the Y values) the resulting 
value wall be in percentage form. Both measures are in the same 
units, the factor of dispersion is made a constant, and thus both 
difficulties arising from differing dispersions and units are over- 
come. 


When the relationship is perfect there will be no deviations 
from the line of regression. Sy will then be equal to zero and 
result in a value of zero for the ratio. If the relationship is poor 
the value of the standard error will be larger, the limit of its 
value being that of the standard deviation. The ratio will thus 
attain as its other limit 1, or 100%. 

A perfect relationship is indicated by a ratio equal to zero and 
an imperfect relationship by a value of 100%. Since this inverts 
the usual manner of thinking in regard to such subjects a more 
readily understandable value can be obtained by subtracting the 


ratio from one: 



Qy 
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The new measure results in a value of 1.00 for a perfect 
relationship and a value of zero for a wholly imperfect relationship. 
A similar measure is termed the coefficient of correlation and is 
used as the comparative measure of association. The formula for 
the coefficient of correlation is 



The coefficient of correlation will have the same limits as the 
value outlined above; viz., zero to 100%. The value of r for the 
problem outlined in table 26 is 


r 


V' 


(2.04) 
(16 74) « 


Although the relationship be good or even perfect it may be 
inverse; that is to say, an increase in the value of X results in a 
corresponding decrease in the value of Y. Under these circum- 
stances the line of regression slopes downward. The value of b, 
the coefficient of slope (also called the coefficient of regression), 
in this equation is then negative. 

The sign of the coefficient of slope (or regression) is attached 
to r to indicate whether it is positive or negative. 


The problem of the measurement of association may be divided 
into three sections: 


1. The determination of the form of relationship — the line of 
regression. 

2. The measurement of variation about the established form of 
relationship — the standard error of estimate. 

3. The reduction of measurement of association to a relative 
basis — the coefficient of correlation. 


Product Moment Method — Ungrouped Data 

By the use of algebraic manipulation (see technical appendix) 
a much less arduous method for computing r, Sy and the line of 
regression ^ may be evolved from the fundamental formula for the 
coefficient of correlation: 



The formula for the simpler method is*’ 


r 


V 


Ux ffy 


I It !■ BBSumed that the recreeBlon line le liuMr. 

■ For the derivatloii of the fermula eee teohniael appendii V. 
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Where, for ungroupeLi data 

KZF) 

“ N \ N ) \ N ) 

\ N V X / 

- - iWW}’ 


Sj; cftn be obtained fioni m transposition of the original formula 
for T from 

1 = 

where r arifl aie knoAMi from the above eornputation 

_ iv; 

6V - (i - 

Nj, JA I - r* 

VVdiere the eqmbnm has its origin at the point of averages tdie 
line of regression inn} be determintd from the formula^ 



h ^ r 


- .c 


since y and x are in trims of deviatiiui from tfinr msj)ectivc means. 
Because tlie line of regn'ssion must pass through the point of 
averages,^ the a (?/-i:.wucept) value will be zero when that point is 
used as the origin. With the origin rmw^ at the point of averages 
the customary equation (here expressed in terms of deviations from 
the means) is resolved into 

T/ = 0 + 6 X 

Where a = 0 

and f) =. r ~ 

a. 


therefore y = r x 

The more usual form of equation with an origin at zero (or in 
terms of actual values) may be determined by a transformation. 
Since y= Y-Y 

x = X -X 

Y - Y = r^(X - X) 

Oz 


' For matLematical proof sec technloal appendix VI. 
’For proof see tpcbiiioal apppnrlix VI 
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For the problem above (see table 26): 

sexy) E(X)S(y) 


N 


N N 



322,227 

/6468\ A252\ 
t 30 / ~ 


30 


JV(X^) / 

1,725,088 /646Sy 

= 

\ N \ N ) S 

30 30 / 

ay = 

1 

>1 

II 

/60,650 /J252y 

30 V 30 J 


p 1743 92 

= .9925 


Ox ' (104.97) (16 74) 

s,= 

5, Vl - = 10.74 Vl 

- (.9925)^ =2.04 


1743.92 

- 104.97 

16.74 


y 

y 



.9925 


16 74 
104 97 ^ 


y = J^582 X 
F - 7-6(X-X) 

Y - 41.73 « .1582 (X - 215.6) 

Y = 7.63 + .1582 X 


Product Moment Method — Grouped Data 

The Correlation Table 

Where there are very many items in the series to be analyzed 
for association the procedure outlined above is unsatisfactory. 
The use of it invites error in proportion to the multiplicity of 
computations while the work of calculation is very great. 

Large groups of data may be handled more efficiently by first 
grouping them into the form of a double frequency distribution 
or correlation table. The procedure is given in table 27 for the 
association between the prices of wheat and flour. 

The items are located in tlie various boxes by reference to the 
class intervals in which their X and F values fall. Thus an item 
with an X value of S.80 and a F value of $.65 will be located in 
the box indicated by the A' class interval $.80 - .89 and the F 
class interval $.60 - .69. 

The calculations may be more readily carried out if, instead 
of trying to use the midpoints of the class intervals as the actual 
values, an arbitrary origin is selected for both X and F values. 
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Table !7 — Conelaiion Table — Wheal and Flaur Pricei 
by Menihi, 1914-1933 


X- Wheat Price per Bushel m Dollars 



7 Le deviations from tlie arbitrary origin should be measured in 
terms of class intervals, as in the shoit rneth^KL for calruliting 
the mean and the standaid deviation. 

7'he necessary values for the detcrimrialion of r by the “piCKluct 
moment'^ method now may be readily secured. 

C* (Ty 

where 

^ “ iV N ' N 



The sum of Jd^dj, is determined . 

1, By securing the cross product of the indicated value of 
dx and d^ for each box (inserted in the lower left hand corner 
of the box). 

2. By multiplying the values obtained by the frequency con- 
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tained within the box (result entered in upper right hand 
corner of box). 

By adding up for all boxes in the table. 

It is to be noted that all calculations for r are carried out in 
tcinis of flnss intervals, not actual values, without reducing X 
and Y to actual units for the computation. The application of 
this technique for the series in table 27 is shown below. 

X [f((hd,)] ^ (fd.) V 6335 / 1 1 1 9 W 999\ 

’’ N N - N 240 \ 240 / \240/ 

- 2(i :i!)5S - (4.0025) (4 1025) = 0 9881 



= x/.'SO ()70S - 21.7:589 = 2 8805 


V 5097 _ /999y 
210 1,240/ 

= V 2,5 7:575 - 17 3201 = 2 5320 
l> _ 0 9881 ^ 

' Cx Cy (2 8805) (2.5320) 

The standard error of estimate may now be computed from 

Sy - QyV 1 - 

by lu--t con\ citing jj, to tlie original units of its series through 
iiiultij)!} ing by the size of the Y class interval (in this case, SI. 00). 

\ = $2 5320V l-( 9501)' = 8.7419 

Ihe line of regicssion now may be obtained, using x and ly in 
oiiginal, not class interval, values: 

(Jy 

y = r — X 
cr, 

y = 9501 - 4.1935 X 

but since 

y - Y -Y_ 

x = X - X 

)■ - 7.0025 = 4.1935 (A’ - 1.4325) and Y = 1.6553 + 4.1935 A 

Coefficients of Determination and Alienation 

When the dependent variable ( Y) is causally related to the 
independent variable (X) and both series consist of simple 
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elements of equal variability, r* measures the variance' in Y 
that is explained* by X. The measure (r*) is then termed the 
coefficient of determination (the phrase index of determination is 
used where the correlation is curvilinear). 

Just as the coefficient of correlation is a relative measure 
of the degree of association between two series, the coefficient of 
alienation is a comparative measure of the lack of association. 


Coefficient of Alienation 



The square of the coefficient of alienation may be interpreted 
in a similar fashion to the square of the coefficient of correlation 
and is known as the coefficient of non-determination. 


1 = r* -h /c* 

and 

k =- Vl - 


Correction for Number of Cases 


When the number of cases is small, the coefficient of correlation 
must be adjusted for exaggeration of its value and the standard 
error must be adjusted for an underestimate. 

The correction formulae are: 


(iV-2) 


= 1 - (1 — r*) 


{N 

iN-2) 


where Sy and r are the corrected values.® 


Other Correlation Methods 
Correlation from Ranks 

The method of measuring correlation from the ranks or position 
of the various items has proven particularly useful in education 
and psychology. 

In this method the data are numbered according to their 
position, as shown in table 28. 

The following formula (Spearman's)^ may then be applied. 

6E(D^) 

> Vananoe iii the ieebnlvU term fur the stiuare of the statulard deviation. 

* See Ezekiel, M., MelhcMia oj Correlation Analj/ria, page 375, note 4. 

* For more complete anRlynis of the formulae for 5, and r eee Ezekiel, M., Methods o/ 
Correlation Anatyaia, p. 121. 

* See Kelley, T. L., iSEteiC«tic 0 l Mathod, pp. 101-1 D4, for derivation of thie formula. 
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Table 28 — lllus^ating Rank Method oF Meaiuring AnoclaNon 
Hypothetical Grades of FiFteen Students on Two Examlnatloni 


Stiiilent 

Number 

Examination 1 

1 Examination 2 

Difference 
in Hank 
CD) 

(D>) 

Glade 

Hank 

Grade 

Rank 

1 

100% 

1 

90% 

3 

2 

4 

2 

98 

2 

96 

1 

1 

1 

3 

95 

3 

89 

4 

1 

1 

4 

91 

4 

87 

5 

1 

1 

5 

90 

5 

93 

2 

3 

9 

6 

85 

6 

80 

0 

0 

0 

7 

83 

7 

80 

7 

0 

0 

8 

82 

8 

79 

8 

0 

0 

9 

81 

9 

76 

10 

1 

1 

10 

80 

10 

77 

9 

1 

1 

11 

70 

11 

72 

11 

0 

0 

12 

()5 

12 

60 

14 

2 

4 

13 

(i3 

13 

02 

13 

0 

0 

14 

00 

14 

50 

15 

1 

1 

I.”) 

50 

15 

03 

12 

3 

9 







32 


where 

p is the measure of correlation* 
D is the difference between the 
two ranks given for each in- 
dividual 

N equals number of individuals. 


For the problem above 

p = 1 - 


0 (:^ 2 ) 

15(225 - 1 ) 


.953 


]n case of ties in rank either one of two methods of assigning 
ranks may be used. Jn the bracket method all are assigned the 
same rank; but the next individual is given the rank that would 
have been assigned if the ties had received successive ranks. 

In the mid-rank method, a rank equal to that of the middle of 
the tie is assigned to all items with identical values. 


Ranks 




Braokct 

Mid Rank 


Sliideiil Grade 

Method 

Method 

A. . 

100% 

1 

1 

B. . 

95 

2 

3 

C. . 

95 

2 

3 ' 

D . 

95 

2 

3 

E. . 

94 

5 

5 

E. . 

92 

6 

6.5 

(1. . 

92 

6 

6.6 

H. . 

90 

8 

8 


* Tile aymboi P is the Greek BmaU letter rho. 
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Assuming that the original data constitute a normal distribution, 
the following relationship may be established between r (the 
coefficient of correlation) and p. 


r «* 2 sin 



Spearmgn 1 FootruU 

When only a rough approximation of the correlation based on 
ranks is desired, Spearman’s “footrule” formula may be used: 


R - 1 


6XG 
- 1 


where G represents the 'positive differences in rank. 


Correlafion and the Time Series 

When two time series are correlated, if there is a similar upward 
trend the higher items toward the end of the first series will be 
associated with the higher items in the second. Due to the 
influence of trend there will be an exaggeration of the correlation. 
In either case the long-time relationship would tend to overshadow 
the short time movements upon which attention particularly 
centers. 

The following methods may be used to overcome the difficulty: 

1. The deviations from trend may be correlated. 

2. The first differences (deviation of each item from previous 
item in series) may be correlated. 

3. The two series may be adjusted for trend. 

Typei of Corrolafion 

I. Simple Correlation; 

Two variables; one independent and one dependent. 

A. Linear Correlation: The change in one variable is at a 
constant ratio to change in the other. 

1. Direct: An increase in one variable is accompanied by 
an increase in the other. 

2. Inverse: An increase in one variable is accompanied by 
a decrease in the other variable. 

B. Non-Linear (Curvilinear) Correlation: The change in one 
variable is at an increasing or decreasing ratio, not at a fixed 
ratio, to a change in the other variable. 

II. Multiple Correlation; 

There are more than two variables. One variable is dependent 
while the other variables are independent. 
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Multiple correlation may be; 

A. Linear : Some variables may be associated directly and others 
inversely. 

B. Non-Linear 

C. Joint: The relationship between various independent and 
dependent variables change with any change in another in- 
dependent variable. 

III. Partial Correlation: 

Partial Correlation measures the association between an in- 
dependent and a dependent variable. It allows for the variation 
associated with specified other independent variables. 
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CORRELATION— NON-LINEAR, MULTIPLE, PARTIAL 

A straight line of regression does not always satisfactorily 
describe the association between variables. Frequently the 
relationship is too complex to be described by means of a simple 
straight line and therefore a curve must be used. 

For instance, if the association between rainfall and crop yield 
is examined it is found that beyond a certain point a doubling of 
the amount of rainfall will not result in a doubling of the crop 
yield. On the contrary, an approximate point could be established 
beyond which the yield would decrease in like proportion up to 
complete extinction of the crop. 


Types of Regression Curves 

The types of curves which may be used are similar to those 
deserribed in the chapter on trend (see chapter VI 1). 

The two most important types of curves are: 

1 . Potential curves of the type 

T . . etc. 

2. Exponential curves of the type 

Y - ab^ 

In terms of logarithms exponential curves may be divided into 

a. Logarithmic 

log y -= a -f 6 log X 

log y - a F log Z + c log 

b. Semi-logarithmic 

log y = a + fc A"' 
log y * a F + cX^ 
or y = a F 6 log X 

y = a F & log X F c (log X)'^ 

I'ignre 2d illustrates the appearance of some of tbe^e eiir\ps 
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From Eiekiel, Mordecai, Met^iods 0 / Correfofifn j4na/i/s»K p 69 


Fig. 23— Curves Illustrating A Number of Different Types 
of Mathematical Functions. 

The normal equations for any of the curves may be derived by the 
same method as outlined for trend curves (Chapter VI). 

For a second degree parabola of the potential group 

The "normal” equations are 

(I) S( F) « Va + 62(Z) + cS(X2) 

(II) i:(XF) -aS(X) +62(X*) +c2(X3) 

(III) I(X2 F) = aS(X*) + bS(Z») + c^{X*) 

For a semi-logarithmic equation of the type 
log F - o F fcX + cX’ 
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the normal equatioDB are 

(1) 2 (log y) - ATa + h2(X) + cS(Z>) 

(II) 2 (AT log F) =o2(X) + 62(X^) +c2(X>) 

(III) 2 (X* log F) = o2 X> + 62(X’) + c2(X*) 
Before fitting an equation of the type 


o + 6A' 

The equation is first converted to 

y - a + 6X 

or F'=a + hX 

where 7' is the reciprocal of Y. 

The fitting of the curve may now be carried out in the usual 
fashion by obtaining the normal equations. 

(1) X(r^) - Na^-hY.{X) 

(II) F') = aY.{X) + 


Non-Linear Standard Error of Estimate 


The standard error of estimate about the curve may be deter- 
mined as before by obtaining the straight line (see page 76) 
from the formula: 


-V^ 


or from ‘ ^ ^ 

Y^) - Y) - 6X(X F) - c:^{X^ F) - rfl^(Z^ F) - etc. 




N 


For types of curves other than the potential series this formula 
may be adapted by using the log of X or F or their reciprocals as 
called for by the type equation. Thus for a curve of the type 


log F + + 

the formula would read 


„ , 2(iog 7)2 _ a^dog F) - 62(X log F) - c^{X^ log F) 
^ 

The Index of Correlation — rho (p) 

When computed about a curve the comparative measure of 
correlation is knov n as the index of correlation. It is assigned the 
Greek letter “rho^’, or p, as a symbol. 



The measure p is equal in value to r, the coefficient of correlation, 
when the regrcvssion is definitely linear. If the regression is non- 

‘The deri /atioD of this formula follows along the same lines as that for the linear standard 
error outlined in teohnioal appendix V. 
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linear, a curve will more closely approximate the data. As a 
result the deviations about the curve will tend to be smaller than 
those about the straight line. The standard error, therefore, 
uill be smaller and will result in a larger value for rho. Rho is 
always either equal to or larger than r. Rho may be computed from 
the following formula:^ 

, _ Y) + 6V(X Y) + cV(X2 Y) a- . . . - 
“ i( y^) - ivc/ 

"ilie index of alienation and the indices of determination and 
non-determination are computed in the same manner and n ith the 
same meaning as for the linear relationships (see page 84). 

It is important to remember that the value for a given index 
of correlation (rho) may be compared to that for another associa- 
tion only when the same type of curve is used to describe both 
regressions. 

At best non-linear regression lines must be used with extreme 
care, since the more complex the line the higher the index of cor- 
relation. An ultimate value of 100% may be reached where a 
curve so complex as to pass through all of the points is used. "I'he 
resulting index would then be meaningless. 


Correlation Ratio 

The correlation ratio is only rarely used. In this measure the 
curve of regression is one which passes through the means of all 
the columns wdien the scatter diagram is divided into columns. 

The regression is not dchned by an equation. 

Since the data must be divided into groups, the correlation ratio 
can be computed from the following formula:* 



where Caj, is the standard deviation of the various values about tlic 
means of their respective columns. 

The value of ig is dependent upon the number of columns in the 
correlation table as compared to tlic number of cases used. With 
enough columns in the correlation table so that there is but one 
item in each column the correlation as computed by the correlation 
ratio would be perfect. A correction for fineness of grouping can 
Lo made by use of the formula: 


w here 


Corrected - 


(>c - :i) 
N 


K is the number of arrays or columns in the correlation table. 


1 See note J on page Hi 

* The lymbol, T], ib the Greek Hinall letter eta 
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Since a curve passing through the means of the columns will 
bo most descriptive of the data, the deviations about this curve 
will be smallest. The correlation ratio, therefore, will be either 
larger than or equal to the index of correlation or the coefficient of 
correlation. If the relationship is essentially linear the means will 
all coincide with a straight line and eta and r w ill be equal. 

Since rho will also equal r if the regression is definitely linear: 

but if it is non-linear: 

IQ > p > r 

Thus where the regression is basicaUy non-linear, iq Avould be 
larger than rho. A test for linearity of regression has been 
devised on this basis. ^ , 

where 'C (zeta)* is the test for linearity. 

When = 0, the regression is linear, when ^ 0 a non-linear 
regression is called for. 

Method of Successive Elimination 

Although in scientific experimentation it is possible to control 
all of the different variables and allow only the factor being studied 
to vary, this is not possible in many fields especially in the social 
sciences and business where numerous uncontrollable factors vary 
simultaneously. The relation of one of these numerous factors 
to a studied dependent variable to the exclusion of the other 
^ anables may be studied by means of the method of successive 
elimination. 

The effectiveness of advertising as gauged by the number of 
returns secured to a keyed advertisement is dependent upon the 
circulation of the newspaper in which it w^as inserted and the size 
of the advertisement' among other factors. If it is desired to 
study the effectiveness of various sizes of advertisements, allowing 
for or more exactly excluding the effect of the differing circulation 
of the various papers in which the advertisements appeared, the 
relationship between circulation and returns is studied by means 
of either linear or non-linear correlation. The line of regression is 
determined by the usual methods. Since this variable is not the 
only factor determining the number of returns the points (actual 
values) wall be scattered about the line of regression rather than 
coincide with it. 

The corresponding theoretical values due to varying circulations 
may now be determined from the line of regression and the dif 
ferences secured between the actual number of returns and these 
theoretical values. These values are termed the residuals 

0' = y - y, 

* The symbol is the Greek small letter seta. 

^ AsfluminE that the advertising oopy is the same for all advertisemcnta. 

'Some oi these values will be negative showing below average returns for a paper of the 
•peoified oiroulatiou. 
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The residuals may now be correlated with the size of the advertise- 
ment to determine their relationship. The line of regression will 
express the variation in the returns if the circulation of the paper 
were held constant. 

Another effective method is to use the line of regression to adjust 
the given data so that the returns are expressed as though all were 
obtained from a paper of a given circulation (as 100,000). An 
allowance is made for papers with smaller circulations by means of 
the regression line and an adjustment is made for larger circulations. 
The b value (in a linear regression) expresses the increase in returns 
per unit of circulation and can be used for this purpose. 

MulHple Correlation 

The fluctuations in a given series are seldom dependent upon 
a single factor or cause. The measurement of the association 
between such a series and several of the variables causing these 
fluctuations or associated with the dependent variable is known as 

multiple correlation. 

Multiple correlation consists of the measurement of the relation- 
ship or association between a dependent variable and tw o or more 
independent variables. The procedure is similar to that for simple 
correlation with the exception that other variables are added to 
the regression equation. For two independent variables the 
regression equation, if linear, is of the type : 

Xi « a -H bii.tXt + bi^.iXi 

In the equation Xi is the dependent or estimated variable 
(replacing the symbol Y previously used) and Xi and Xi are the 
independent variables. 

The coefficient of regression or slope, hu i, indicates the number 
of units change in the dependent variable for a given unit change 
in Xi, while indicates the change in Xi for a unit change in A*. 
However, in the computation of these coefficients of regression the 
associations of each of the other independent variables with the 
dependent variable was taken into consideration. The co- 
efficients, therefore, indicate the net relationship between the 
dependent and an independent variable, allowing for the other 
factors or variables which are also considered in the equation. 
The subscripts after the period indicate the other variables 
included. Thus 6 ii.»4b would give the net regression of variable Aj 
in relationship to Ai, allowing for A*, A4, and Ab. The last 
named coefficients are therefore known as the coefficients of net 
regression. 

The values for the coefficients may be obtained in the usual 

o\)\.a\inng Vbe “normaV’ equations. For a linear relationship 
with three independent variables: 

Ai - a -|- bii.nAi -|> bu.uXi -h &H.UA4 
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the normal equations would be 


(I) 

2(.Y,) 

~ Na + h 

i,.»^(V.) + 6 u.t 4 2(X,) 

+ bii-H 2( 

(II) 

2(XiV.) 

-a2(X.) 

-h 612.34 

S(X.») + b„.u 2 

(X,X,) + 


bit-n 

X2A4) 



(III) 


= al’(X,) 

+ 612.34 


2(X,*) + 



bu.^, i:( 

X3X4) 



(IV) 


= alY.YO 

+ 6 12.14 

i:(X,-Y4) + b, 3.,4 

2 (X,X 4 ) + 


^^14. 23 ^(^4*^) 


By assuming the origin of the equation to bo at the point of 
averages this equation reduces to;^ 

(1) pia = i!^12-a4 0^2 + ^13-24 ?^28 + &14-2S p24 

(11) piz = ^02-34 P23 + ^^13-24 “h ^14-23 ^84 

(III) 2^14 = 612.34 7?24 H“ 613.24 p34 + 614 23 

The equations may novs be solved simultaneously to arrive at 
the desired values for 612.34, 613.24, and 614.23. 

The standard error of estimate may now be computed from 


. 1 ^ 

5 1.234 = 


or more readily from ^ 


* 5 ^ 1-234 = cr^i — 612.34 P 12 — 613.24 7^13 — 614.23 Pl 4 
and the coefficient of multiple correlation from 


or 




j, r. ^^-234 

R 1-234 ~ 1 

612.34 Pl 2 + 613.24 7>13 + 614.23 pl4 


The coelficients of multiple alienation, determination, and non- 
determination may now be computed and applied as in simple 
correlation. 

The same technique may be used for non-linear multiple corre- 
lation, using the general equation: 

Xi = a +/(X2) +/(X3) +/(X4) + etc. 

where “/ (X2)” indicates any function of X2 such as a parabola, 
etc. As tlie complexity of the function increases the amount of 
computation required soon becomes so great as to render the use of 
the technique prohibitive.® 

1 See teDba\oal appemUx Vll — "a” may be obtained from Iho Bret "normal” equation 
^(X) - Va + bn.i4 2(Xj) + 6„.„ X(XO + bn-zi 2(X0 

* See teohnical appendir VIII, 

* A leas arduouB teobnique for ourvilinear multiple oorrelation ia outlined in Eankiel, M., 
MelhinU of Correlation Analyne. 
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Partial Correlation 

Coeffichnt of ParHal Correlation 

When it is desired to compute the separate or net effect or 
importance of each independent variable the technique of partial 
correlation may be used. 

The coeffif'ient of partial correlation is a relative measure of 
the association between the dependent variable and a given 
independent variable, eliminating the effect of the other inde- 
pendent variables. 

There are a number of formulae which may be used to compute 
the association. 

I- ^ 13-24 = V 613.24 - 631.24 

ri2 — 7*13 r23 

Vi - na' Vi — 

. (1 — 

‘ (1 - 2^^.23) 

Coefficient of Part Correlation 

A similar coefficient somewhat easier to compute is called the 
coefficient of part correlation (i 2 r 34 ). 

The differences between these two coefficients may be shown 
by comparing tlic coefficient of partial correlation to that resulting 
from the correlation of:^ 

{X2 — 623.4 Xz — 624.3 .^^4) wdth {X\ — 613.4 X3 — 614.3 Xi) 
while the coetRcient of part correlation may be compared to the 
correlation between 

X2 and (Xi — 613.24 X3 — 614.23 X4) 

The coefficient of part correlation may be computed from 

2 6^12*34 

(1 - fl^.,34) 

The subscripts to the right of the letter indicate the variables 
excluded. 

Beta Coefficients 

The relative importance of the indhddual independent variables 
in a multiple correlation in determining the dependent variable 
imay be determined through resort to the beta coefficients. 

The coefficients of regression of the multiple correlation regres- 
sion equation indicate the increase in the dependent variable 
resulting from a unit increase in the indicated independent vari- 
able. However, the various independent variables are often 
expressed in different units making a direct comparison of the 

■Bucgeited by Ecekiel in M^hod* 0 / Correlation Analvoit. Pnge 1B3 


2 . r]2.a = 

'k ri4.23 *= 
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coefficients impossible. The coefficients of multiple regression may 
be made comparable by dividing each variable by its own standard 
deviation. Thus the original multiple regression equation 



Xi 

=- a 4- hi2,3^2 + bis.nXc, 


Xi 

, , ^2 , , As 

becomes 


— (7 -f 012.3 4- ei3-2 



02 C73 

and tlierefor 


, a2 

3i2 3 ^ f>12 3 — 




3 i 3-2 =- hi^2~ 

The beta coefficients are comparable measures and indicate the 
mcnvase in the dependent variable resulting from an increase of 
ofir fhrini}(}n in each inde])eTident variable. 


A DDITIOXAL BIBLIOGRAPHY* 

Bowlky, AiiTUTUi L., EInncnis of pp. 350-408. P. S. 

King k Son, London, 1907. 

Lisueii, R. A., Stati^iical Methods for Research Workers, pp. 

152 -223. Oliver Boyd, Edinburgh, 1932. 

Ivr.LLEY, Truman L,, Ivterprctation of PJdvcati(ninl .17crT.‘?//rc///cn^.‘?, 
pp. 180. World Book Go., Yonkers, N(‘w York, A Ghicago, 
Illinois, 1927. 

OoELL, Yk, Edifcaiional Statistics, p]i. 200 213; 215-279. 
Gentury C'o., Now York, 1925. 

Otis, Arthi r S., Statistical Mithod in PJdvcationaJ Mensnreoicnt^, 
pp, 200 215. World Book Cka., Ytmkers, Xrw York, A 
Chicago, Illinois, 1920. 

Hietz, H. L,, (Editor), Handbook of Mathematical Statistics, 
pp, 129 -149. Houghton Mifflin Co., New York, 1924. 

St TtTjFFE, William G,, Statistics for The Business Man, 
pp. 224-228. Harper & Bros., Ncav York, 1930. 

• For readingB in ntandard Statiatica textbooks, see the QUICK HEFLRENCL TABLE TO 
STANDARD TEXTBOOKS following Table of Contents. 



CHAPTER XI 


CORRELATION OF ATTRIBUTES 

The previous section dealt with the measurement of the asso 
ciation between two measured characteristics of a given set of 
items. Two quantitative values were determined for each item 
and a coefficient of correlation was computed for the various 
paired values. Thus if the height and weight of a group of in- 
dividuals are measured the coefficient of correlation can be com- 
puted. It will show the degree to which the greater heights are 
associated wdth the greater weights. 

It is not alw^ays possible to use measurements for various char- 
acteristics. For instance, many classifications are qualitative 
such as light and dark, good and poor, etc. The association 
between heights and w^eights may be determined by classifying 
the individuals as light and heavy and tall and short. The 
association between the two characteristics may be determined by 
cross classifying each individual in a fourfold table of the type 
shown below^ where a, b, c, and d represent the number of cases 
with each of the given pairs of characteristics. 



Shorl 

Tali 

Total 

Light . 

a 

h 


Heavy , 

c 

d 


Total . 

1 


N 


If the association is perfect — that is, if all tall people are heavy 
and all short people are light — all of the cases will be located in 
two boxes (cells) — in this case a and d. If there is absolutely no 
association — that is, if it is a matter of indifference insofar as weight 
is concerned if a person is tall or short — the cases will be distributed 
at random throughout the four boxes. Since there is an equal 
likeliho id of a case appearing in any box there wall tend to be an 
equal number in each box. 

In a similar manner the qualitative characteristics of a group 
of individuals or items may be arranged in a table when more 
than two alternative attiibutes are dealt with. A table of this 
type is shown below ; 
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A 

B 

c 

D 

Total 

A' 


i 



Tl n 

B' 


n re 



rirt 

C' 





nr. 

D' 





Tl n 

Total 

Tl Cl 

Tic 2 

Tl rs 

Tlr4 

N 


where A BCD and A', J 5 ', C', D\ are two sets of qualitative 
specifications." 

In this type of table if the associatif)n is perfect the cases will 
all fall in a diagonal row of cells if the table is symmetrical, for 
quality A will always accompany quality A\ B will accompany 
quality etc. If there is no correlation there will be a tendenry 
towards an equal distribution of cases among various boxes. 

It is possible on the basis of these facts to compute a coefficient 
of association which will serve as a comparative measure of 
correlation. 

CoeFficient of Contingency 

This coefficient of association (the coefficient of mean-square 
contingency) is based upon a comparison of the number of cases 
actually occurring in a given cell or box and the number of cases 
vhich would occur in the cell due to chance or a comparison of 
the actual distribution and the distribution occurring when there 
is no association. 

If Tiris the number of cases in a given row, Ve is the number 
in a given column and Urc is the number in a given cell 

UrTle 

n.c-^ 

will be the difference betw^een actual number of cases and the 
number of cases occurring due to chance. But the ratio of the 
square of this value to the theoretical number of cases is 
(test for goodness of fit)^ or 



nrHc 


L iv J 

Pearson's mean-square contingency, 02 jg obtained by dividing 
this value by A^* 


* The nualitaiive specifications should be arranged in order where there are more than two 
such as poor, fair good, etc 

* See page 109 

*0 ia the Greek letter phi. 
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and the coefficient of mean-square-contingency is 


I ' 1 + <^2 ^ 

A simpler form of this formula as shown by Yule is 


where 


c 

.V = 2 

V N^/ 

\nrnj 


Steps in Computation 

1. Square the value in each cell (rirc^) 

2. For each box multiply the number in that row by the number 
in that column (ric). 

3. Divide the squared value in each box (jirtY by the cor- 
respondinfi; nr?<c- 

4. Sum up for all ro\\b and divide by N. The resulting value 
is S. 

5. Substitute in* ^ 

^ “ V s 

FourFold Tables (2x2 classifications) 

AVhen two variates are classed in alternate categories, A and 
not A, B and not B, 

1. Yule’s coefficient of association and coefficient of colligation ^ 
The coefficient of association is 

ad — he 
^ = ad-^bc 

The coefficient of colligation is* 

' \/ ad — 

+ \/^ 

where a, b, c, d represent the frequencies contained vdthin the 
various cells as follows: 


a h 

c d 

> It has beeo shown that the number of classihcsbionh m the table will effect the largeal 
possible value of C- When a table of 2 a 2 clarifications is used C cannot exceed .707, when 
4x4 the maximum value is 866 when o x a it cannot exceed .894, when 10 i 10 the max- 
imum value IS 949 

’ Objections have been offered to the use of both of these measures 
* IS the Greek letter Omeaa 





CORRELATION OF ATTRIBUTES 


m 


When the relationship is perfect all of the cases will be concen- 
trated in two of the boxes either ad or be and therefore both Q 
and W will be equal to 1.00 or — 1.00. When there is no asso- 
ciation the distribution of cases will be equal (a ^ b = c ~ d) and 
therefore both Q and W will equal zero. 


2. Pearson^ s cosine method. 

The coefficient of correlation by the cosine method for a four 
fold table is 

r *= cos 


\/hc 


\/ ad -h y^bc 

This coefficient wdll vary from r = 0 to r = 1.00. W hen the 
association is perfect there will be frequencies in onl}^ two of the 

squares (either a and d or b and c) and tlierefore — 

will equal zero and r will equal 1.00. Wiien there is an equal 
distribution of cases a = 6 = c = d tlie fraction will equal .50 
and r « .00. 


Biserial CoeFficient of Correlation 


When the table is of 2x N classifications, i.e., w'hen there are 
only two possible categories of one attribute and a number of 
classifications for the other attribute, the biserial coefficient of 
correlation may be computed. This type of classification is com- 
mon since classification by sex and other similar tw^o-fold categories 
appear frequently. It assumes that the distribution of the 
attribute is approximately normal. 


The formula is 
where 


biserial r = 


( A p X g)pg 
cj X .3989 h 


Xp =* the mean value of the p category* 

Xq = the mean value of the q category 
p = percent of cases in p category 
q = percent of cases in q category 

a « standard deviation of combined categories (p and q) 
h = height of ordinate of normal curve at a distance from the 

mean including — ■ of the area of the curve. 

The value h is computed by determining the number of standard 

deviations from the mean including ^ ^ of the area of tlie nor- 

mal curve (from the table of the area of the normal curve, page 1 1 0) 
and using that number of standard deviations .n the table of the 

* The various oategorioa of the attribute are aealgned quantitativ^e values 1, 2 3, etc . if nni 
m quaotitative fofm. 
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ordinates of the normal curve to determine the height of the 
curve (in percent of the maximum ordinate) at that point. 

ADDITIONAL BIBLIOGRAPHY* 

Rietz, H. L., (Editor), Handbook of Mathematical Statistics j Hough- 
ton Mifflin Co , New York, 1924 

• For readingB in standard StatistiCB teitbooke Bee the QUICK REFERENCE TABLE TO 
STANDARD TEXTBOOKS following Table of Contents 



CHAPTER XII 
THE NORMAL CURVE 


When events must occur in one of two ways, and there are 
many such events, they tend to be equally divided into two groups; 
the first consisting of favorable (or desired) occurrences, the second 
of unfavorable (or non-desired) occurrences. The desired result may 
be determined empirically by counting and recording the results. 

The chances are even that whenever a coin is flipped a "head^' 
will appear. Careful analysis of this somewhat obvious state- 
ment will lead to a better understanding of the nature of proba- 
bility and in turn of the theory of sampling. Since only a “head^' 
or a “tail” can appear, the probability of success in the appear- 
ance of the desired face can be stated as one half. 

1 

P = 2 


V - 


where : ^ 

a = number of ways in which a favorable outcome can appear. 
N *= total number of possible events. 
p « probability of success. 

In general if an event can happen in a ways and not happen in 
b ways, the probability of its occurrence is 

a a 

wherfe 
o-h6 - AT 

The probability of failure is therefore: 

q~ — 

where : ^ 

b «= the possible number of unfavorable results. 
q - probability of failure. 

If a coin is tossed one hundred times for “heads” or “tails” 
the probability of either result for each toss is Y 2 . This ratio 
of probability may be expressed for the group as (H)(100). 

The probability of occurrence of one or another of a series of 
mutually exclusive events may be secured by adding the prob- 
abilities of the individual occurrence. To determine the probability 
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of occurrence of all of a series of events the probabilities are mul- 
tiplif:!d. Thus if a card is drawn from a deck of cards, the proba- 
bility of either drawing a spade or a club is 

The probability of the occurrence of both of these events is 

The probability that an event will happen or fail to happen is 
certainty to which is assigned the value 1. 

The probability of the appearance of either a head or a tail in 
the toss of a coin is certainty 

(MA + 3^0 = 1 

In similar manner for a toss of two coins the probability of 
appearance of heads and tails may be determined from 

iV2h + v2iy 


or + 

or probability of all “heads’^ = 

probability of all 'Tails” == 

probability of a "head” and a "tail” = 3^ 

More generally the probability of occurrence of an event a given 
number (>f times in N trials is expressed by^ 

N{p + g)" 

Figure 24 graphically presents the distribution of the number of 
heads appearing theoretically in tosses of ten coins. As n is in- 
creased the number of points to be plotted will increase and the 
curve will become smoother and smoother ultimately approaching 
ill form the "normal” or Gaussian^ distribution as seen in Figure 25. 

Thi.s type of curve frequently results when data exhibiting 
chance variation are plotted. 

Tlic mran of such a distribution may be obtained as follo^vs: 


where: ^ 

N = number of trials 
]) = probability of success 


The standard deviation: 

(T 

where ; 


\/ Npq 


q = the probability of failure 

* a’he resultiiiK dielributiun referred to afl the Tlinoinial distribution. Itin to be noted that 
thiB 18 a dierrete distribution. 

’The fnrniiil'i. for the Gausaiau dietributinn ih; 
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0123456 709 10 

Number of heads 


P''ig. 24 — Theoretical Distribution of Number of "Heads" Appearing 
in Tosses of 10 Coins. 


The distribution is the normal or Gaussian distribution pre- 
viously referred to. The standard deviation shown here will 
bear the same relation to the distribution as outlined previously 
(see page 39). 


Number of Standard Deviations 

Percent of Ce 

(Measured plus and minus from the mean) 

Included 

.6745 a 

50% 

1.0000 0 

68.26% 

2.0000 cj 

95.46% 

3.000 a 

99.73% 

Generalization of Curves 



When only a limited number of items are available the fre- 
quency distribution compiled from them is generally irregular in 
form. If the number of items is increased sharply the distribution 
will show a tendency to eliminate irregularities and be smoothed. 

The sample is frequently used to generalize about the underly- 
ing data (technically the population or universe) and therefore it 
is often desirable to smooth the irregular curve obtained from the 
sample. By smoothing, the curve of the sample is put into the 
more general form of the underlying data, or in other words into 
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an "idear^ distribution. The ideal distribution represents the 
distribution which would appear if an infinite number of cases 
were used rather than a sample. 

Where it is believed that the underlying data can be best 
described by means of a normal curve,' data can be smoothed 
by this curve by use of special tables of the ordinates and area 
of the normal curve (see pages 110-111)*. 


Area Method oF FifMng Normal Curve 


In a normal distribution the percentage of the cases (area 
under the curve) included within any number of standard devia- 
tions measured from the mean can be determined from specially 
prepared tables of areas of the normal curve. 

If a distance equal to one standard deviation is measured in 
both directions from the mean it will include 68.26% of the cases, 
two standard deviations will equal 95.46% of the cases, etc. 
(see p. 105). In a similar fashion the percentage of cases in- 
cluded within a distance equal to any number of standard devia- 
tions but measured in only one direction from the mean will give 
one half of the percentages shown above. 

The mean of the normal curve is located at the center of the 
distribution. If a distance from the mean to any given point 
on the X axis is determined m terms of standard deviations y 
the area included within this distance may now be determined 
by reference to table 31 (page 106). For example, in figure 25 the 
distance from the mean $50 to point b $62 is $12. The standard 
deviation of the distribution is given as $8. Thus the distance 
from the mean to this point in terms of standard deviations is 1.5 
standard deviations ($12 $8). By reference to figure 25 it will be 

seen that 43.32% of the cases are included between the mean and 
$62, If another point on the X axis is now selected, say $64, it' can 
be found that since this p)oint is 1 .75 standard deviations away from 


the mean 


( 


$64 - $50 
8 



of the cases will be included 


within these limits. Since there are 45.99% of the cases between 
$64 and the mean and 43.32% of the cases between $62 and the 
mean, 2,67% (45.99% - 43.32%) of the frequencies must occur 
between $64 and $62. As there are 4000 cases in all, 106.8 of 
them arc between $62 and $64 in value. 

In a similar manner the percentage of cases included within the 
limits of any one class interval may now be determined , and the 
number of cases or the theoretical frequency can be found by 


» See Chapter XV for methode of determining the type of underlying curve. 

* Various other types of curves may be used to smooth a frequency distribution. Two 
groups of our^'es freciueiitly used for this purpose are the Pearson system of curves and the 
Gram Charlier senes. 

Since the technique of fitting the cur\ eg is beyond the scope of this book, the reader is 
referred to Elderton, W. P., Frequency Curves and Correlation for the Pearson curve, and 
Camp. B. W., AlatheTnatical Part of Elementary Statiaticn for the Gram Charlier Group. 
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Fig. 25 

applying that percentage to the total number of cases. If the 
theoretical frequencies are then plotted, the result will be a normal 
curve. 

The determination of the theoretical frequency (for one of the 
class intervals) in the distribution of the variations in the thick- 
ness of 600 brass washers manufactured by the A. B. C. Co. as 
shown below may be used as an example of this procedure. The 
fitting of this curve will make it possible to determine the variation 
to be expected in large numbers of these washers as manufactured 
by this company. 

The following measures were computed from the sample: 

X - .0202 inches 
a - . 00085 inches 
N - 600 

The third class interval has a lower limit of .0188 and an upper 
limit of .01919. The distance between the lower limit and the 
mean (.0202 - .0188) is .0014. Since the standard deviation is 
.00085 inches, in terms of standard deviations, the distance is 
1.66 standard deviations. Reference to the area table indicates 
that 45,15% of the cases are included within this distance. Be- 
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Tabu S9 — Fitting of Normal Curvo Area Method 
Variation of Thicknesi in 600 Brais Waihen Manufactured by the ABC Co. 

(Hypothetical Data*) 


(1) 

ThiokneBB 
(In InoheB) 

(2) 

Mid- 

points 

(3) 

Number 
of Washers 
(/) 

(4) 

Deviation 
of ClaBB 
Limit 
From 
Mean 
(x) 

(5) 

Column 
14) In 
Terms Of 
Standard 
Deviations 

if) 

(6) 

Feroent 
of Area 
Between 
Class 
Limit 
and Mean 

(7) 

Percent 
of Area 
In Class 
Interval 

(8) 

Theoretioal 

Prequenoy 

/ 

0180-. 01839 

.0182 

6 

- .0022 

- 2.61 

49 55% 

1 21% 

7.3 

.0184-. 01879 

.0186 

30 

- .0018 

- 2.13 

48.34 

3.19 

19 1 

0188-. 01919 

.0190 

42 

- .0014 

- 1 66 

45.15 

7.05 

42 3 

.0192-. 01959 

.0194 

66 

- .0010 

- 1 18 

38.10 

11 98 

71.9 

.0196-. 01999 

.0190 

94 

- .0006 

.71 

20.12 

16 64 

09 8 




— .0002) 

.24 1 

9 48) 



.0200- 02039 

.0202 

120 

.00021 

.24 1 

9.48 1 

18.96 

113.8 

.0204-. 02079 

.0206 

102 

.0008 

.71 

20.12 

16 64 

99 S 

.0208-02119 

.0210 

60 

.0010 

- 1.18 

38 10 

11.98 

71.9 

.0212-. 02159 

.0214 

54 

0014 

- 1 66 

45.15 

7 05 

42 3 

.0216-. 02199 

.0218 

14 

0018 

2.13 

48 34 

3 19 

19.1 

.0220-. 02239 

.0222 

12 

0022 

2.61 

49.55 

1.21 

7.3 



600 







* Hypothetioal data based on smaller distribution given by W. A. Shewbart, J?ronomic 
Control of Quality of Manufacturtd Product. 


tween the lower limit of the next class interval and the mean 
(.0202 — .0192 = .0010) there is a distance equal to 1.18 standard 
deviations. The area table indicates that 38.10% of the cases are 
included within this distance. It can then be seen that there 
must be 7.05% of the cases between .0188 and .01919 or between 
the upper and lower limits of group three. Since the total fre- 
quency is given as 600 the theoretical frequency for this particular 
class interval will be (7.05% of 600). The frequency is then 
plotted at the mid-point of the group. The same process is re- 
peated for all other class intervals as shown in table 29. 


Fiffing the Normal Curve — Ordinate Method 

The normal curve may also be fitted by reference to a table 
of ordinates of the probability curve (p. 115). The table gives 
the ordinates of the normal curve at any distance (in terms of 
standard deviations) from the mean as a percent of the maximum 
ordinate. The maximum ordinate occurs at the center of the 
distribution. 

The formula for the normal curve is 
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where' ^ 

y*. the maximum ordinate — y= 

^y/2% 2.506628 a 

The value of the maximum ordinate ( Fp) for the distribution 
is computed below;' 

(j (class interval units) - 2 109 

Y N _ mo 

2 506628(7 2.109(2 506628) 

For the midpoint of the 6rst group which is 2.37 standard devia- 
tions (.0020 inches) from the mean we find that an ordinate erected 
at this point would be 6.03% of the maximum ordinate or 6 84. 
The same procedure may be used on the other class intervals as 
shown in table 30. 


Tabu 30— Filling of Normal Curve— Ordinate Method 
Variation of Thlclcnesi in 600 Brau Wishers Manufactured by the ABC Co. 


(1) 

Thiokness 
(In Inohes) 

(2) 

Mid- 

points 

(3) 

Number of 
Washers 
(/.) 

(4) 

Deviation 
of Mid-Point 
from Mean 
(x) 

Column 4 
Id Terms 
of standard 
Dev latiouB 

© 

Percent of 
Maximum 
Ordinate 
(From Or- 
dinate Table) 

Theoretical 

Frequenov 

(/) 

OIBO- 01S3B 

0182 

6 

0020 

2 37 

6 03% 

6 84 

0184- 01879 

0186 

30 

0016 

1 90 

16 45 

18 67 

0188- 01919 

0190 

42 

0012 

1 42 

36 49 

41 42 

0192- OlO'jg 

0194 

66 

0008 

95 

03 68 

72 28 

0106- 01999 

0196 

94 

0004 

47 

89 54 

101 02 

0200- 02039 

0202 

120 

0000 

00 

100 00 

113 50 

0204- 02079 

0206 

102 

0004 

47 

89 54 

101 628 

020S- 02119 

0210 

60 

0008 

95 

63 68 

72 28 

0212- 02159 

0214 

54 

0012 

1 42 

36 49 

41 42 

0216- 02199 

021S 

14 

0016 

1 90 

16 45 

18 67 

0220- 02239 

0222 

12 

0020 

2 37 

6 03 

6 84 
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Tasting the Goodneu of Fit — The Chi Square Test 

A test to determine the goodness of fit of the actual data to the 
theoretical distribution has been devised by Karl Pearson 
The test involves the calculation of x* (chi square)* 




(^■) 


where ' \ / 

fo ^ the observed or actual frequencies 
f - the theoretical frequencies 


For the problem outlined above chi square may be calculated 
as in table 29a ^ 

^ In tbe apphoation oi tlus (ormnln the ttnndnrd deviation in q\ub intetv&l unite, not otikiubV 
\in\te, mnet be need 

tbe value ol / for any claee interval iB very smaW, several e^ouph must be combined in 
order not to oLtain disproportionate values of X* Table 29a 
* The symbol X !• t^be Greek small letter ehi 
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Tabl« 31 — Normal Curvo Aroa Tablo 


X 

a 

.00 

.01 

.02 

.03 

.01 

.05 

.06 

.07 

.08 

.09 

0.0 


.0040 

.0080 


.0159 

.0199 

.0239 

.0279 

.0319 

.0359 

0.1 

.0308 

.0438 

.0478 

.0517 

.0557 

.0596 

.0630 

.0675 

.0714 

.0753 

0.2 

.0793 

.0832 


.0910 

.0948 

.0987 

.1026 

.1064 

.1103 

.1141 

0.3 

.1179 

.1217 

.1255 

.1293 

.1331 

.1368 

.1406 

.1443 

.1480 


0 4 

.1554 

.1591 

.1628 

.1664 

.1700 

.1736 

.1772 

.1808 

.1844 


0.5 

.1915 


.1085 

.2019 

.2054 

.2088 

.2123 


.2190 


0 6 

.2267 

.2291 

.2324 

2357 

.2380 

.2422 

.2454 


.2518 


0 7 

.2580 

.2612 

.2642 

.2673 

.2704 

.2734 

.2764 

.2794 

.2823 

.2852 

0.8 

.2881 

.2010 

.2030 

.2967 

.2995 

.3023 

.3051 

.3078 

.3106 

.3133 

0.9 

.3159 

.3186 

.3212 

.3238 

.3264 

.3289 

.3315 

.3340 

.3365 

.3380 

1.0 

.3413 

.3438 

.3461 

.3485 

.3508 

.3531 

.3554 

.3577 

. 3599 

.3621 

1.1 

.3643 

.3665 

.3686 

.3718 

.3729 

.3749 

.3770 

,3790 

.3810 

.3830 

1.2 

.3849 

.3869 

.3888 

.3907 

.3925 

.3944 

.3962 

.3980 

3997 

.4015 

1.3 

.4032 

.4049 

.4060 

.4083 

.4099 

.4115 

.4131 

.4147 

.4102 

.4177 

1.4 

.4102 

.4207 

.4222 

.4236 

.4251 

.4265 

.4279 

.4292 

.4306 

.4319 

1.5 

.4332 

.4345 

.4357 

.4370 

.4382 

.4304 

.4406 

.4418 

.4430 

4441 

1 e 

.4452 

.4463 

.4474 

.4485 

.4495 

.4505 

.4515 

.4525 

.4535 

.4545 

1.7 

4554 

.4564 

.4573 

.4582 

.4591 

.4599 

.4608 

.4616 

.4625 

.4633 

1.8 

.4641 

.4640 

.4656 

.4664 

.4671 

.4678 

.4686 

.4693 

.4699 

.4706 

1.0 

.4713 

.4710 

.4726 

.4732 

.4738 

.4744 

4750 

.4758 

.4762 

.4767 

2.0 

.4773 

.4778 

.4783 

.4788 

.4793 

.4798 

.4803 

.4808 

.4812 

.4817 

2.1 

.4821 

.4826 

.4830 

.4834 

.4838 

.4842 

.4846 

.4850 

.4854 

.4857 

2.2 

.4861 

.4865 

.4868 

.4871 

.4875 

.4878 

.4881 

.4884 

.4887 

.4890 

2.3 

.4893 

.4896 

.4898 

.4001 

.4904 

.4906 

.4909 

.4911 

.4013 

.4916 

2.4 

.4018 

.4020 

.4022 

.4925 

.4927 

.4029 

.4031 

.4032 

.4934 

.4936 

2.5 

.4938 

.4940 

.4041 

.4043 

.4945 

.4046 

.4948 

.4949 

.4951 

.4952 

2.6 

.4953 

4955 

.4056 

.4057 

.4959 

.4960 

.4961 

.4962 

.4963 

.4964 

2.7 

.4065 

.4066 

.4067 

.4068 

.4069 

.4070 

.4971 

.4972 

.4973 

.4074 

2.B 

.4074 

.4975 

.4976 

.4977 

.4977 

.4078 

.4970 

.4980 

.4980 

4081 

2.0 

.4981 

.4082 

.4983 

.4984 

.4984 

.4984 

.4985 

.4985 

.4986 

.4986 

3.0 

.49865 

.4987 

.4987 

.4088 

.4988 

.41^88 

.4989 

.4989 

.4986 

.4990 

3.1 

.49003 

4991 

.4091 

.4991 

.4092 

.4992 

.4992 

4992 

.4993 

.4003 


Table 29a — The Chi Square Teit for Goodness of Fit — Data of Table 29 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Thiokneai 

Number of 

Theoretioai 



(/. - /)• 

/ 

(In Inahea) 

Wuhere 

(/.) 

Frequenoy 

1/) 

(/. -f) 

if, “/)• 

.0180-. 01839 
.0184-. 01879 

6 

30 

7.3\ 

19.1/ 

9.6 

92.16 

3.491 

.0188-. 01919 

42 

42.3 

.3 

,09 

.007 

.0192-. 01959 

66 

71.9 

- 5.9 

34.81 

.484 

.0196-. 01999 

94 

99.8 

- 5.8 

33.64 

.337 

.0200-. 02039 

120 

113.8 

6.2 

38.44 

.338 

.0204-02079 

102 

99.8 

2.4 

6.76 

.058 

.0208-02119 

60 

71.9 

- 11.9 

141.61 

1.970 

.0212-. 02159 

54 

42.3 

12.3 

151.29 

3.628 

.021 6-. 021 99 

14 

19. 1\ 

A 

.16 

.001 

.0220-02239 

12 

7.3/ 

— 






- 10.314 
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Table 32— Normal Curve Ordinotei 
(Ai decimal oF maximum ordinale) 


z 

a 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.00 

0.0 

1.00000 

.99995 

.99980 

.999.55 

. 99920 

.99975 

.09820 

. 99755 

.99685 

.09596 

0.1 

.99501 

.09396 

.99283 

.99158 

.99025 

.98881 

.08728 

.98565 

.98393 

.98211 

0.2 

.98020 

.07819 

.97609 

. 97390 

.07161 

.96923 

.06676 

.96420 

.06156 

.05882 

0.3 

. 95 GOO 

. 95309 

.95010 

. 94702 

94387 

.04055 

. 03723 

.93382 

. 03024 

.92677 

0.4 

.92312 

.91390 

.91558 

.91169 

90774 

.90371 

.80961 

.80543 

.89110 

. 88688 

0.5 

.882.50 

. 87805 

.87353 

. 86896 

.86432 

.85962 

.85488 

.85006 

.84519 

.84060 

0 0 

. 83527 

. 83023 

. 83514 

.82010 

.81481 

.80957 

.80429 

. 79896 

.79450 

.78817 

0.7 

.78270 

.77721 

.77167 

.76610 

.76048 

75484 

.74916 

. 74342 

. 73769 

.73193 

0.8 

.72615 

.72033 

.71448 

.70861 

.70272 

.69681 

.69087 

.68493 

.67896 

.67296 

0.9 

. 66689 

-66007 

. 6.5494 

64891 

.64287 

.63683 

.63077 

. 62472 

.61865 

.612.59 

1 0 

. 606.53 

.60047 

. 59440 

. 58834 

.58228 

.57623 

.57017 

56414 

.55810 

.55209 

1 1 

.54607 

.54007 

.53409 

.52812 

.52214 

.51620 

.51027 

.50437 

.49848 

. 40260 

1.2 

.48675 

.48092 

.47511 

.46933 

.46357 

. 45793 

.45212 

.44644 

.44078 

.43516 

1.3 

.42956 

. 42399 

.41845 

.41294 

.40747 

.40202 

.39661 

.39123 

.38569 

. 38058 

1.4 

.37531 

.37007 

38487 

3.5971 

35459 

34950 

.34445 

33944 

.33447 

.32054 

1.5 

. 32465 

.31980 

.31.500 

.31023 

.30550 

30082 

.20618 

.29158 

. 28702 

.28251 

1.6 

.27804 

.27361 

.26923 

.26489 

. 26059 

25634 

.25213 

. 24797 

. 24385 

. 23978 

1.7 

. 23575 

.23176 

.22782 

. 22392 

.22008 

.21027 

.21251 

. 20879 

.20511 

.20148 

1.8 

. 19790 

. 19436 

. 19086 

.18741 

.18400 

18064 

.17732 

-17904 

. 17081 

. 10762 

1.9 

. 16448 

.18137 

.15831 

. 15530 

. 15232 

. 14939 

.14650 

.14354 

14083 

.13806 

2.0 

. 13534 

. 13265 

.13000 

. 12740 

.12483 

.12230 

.11081 

.11737 

. 11496 

.11259 

2.1 

.11025 

. 10795 

'. 10570 

.10347 

. 10129 

.09914 

.09702 

. 09495 

.09290 

.00090 

2.2 

.08892 

.08698 

.08507 

. 08320 

.08130 

.07956 

.07778 

.07004 

.07433 

.07265 

2.3 

.07100 

.06939 

.06780 

.06624 

.06471 

.06321 

.06174 

.06029 

.05888 

.05750 

2.4 

.05614 

. 0.5481 

.05350 

. 05222 

.05096 

.04973 

.04852 

.04737 

.04018 

.0450? 

2.5 

.04394 

.04285 

.04179 

04074 

.03072 

.03873 

.03775 

.03680 

.03586 

.03494 

2 6 

.03405 

.03317 

.03232 

.03148 

.03060 

.02986 

.02908 

.02831 

,02757 

.02684 

2 7 

.02612 

.02542 

.02474 

02408 

.02343 

.02280 

.02218 

.02157 

.02098 

.02040 

2.8 

.01984 

.01929 

.01876 

.01823 

.01772 

.01723 

.01674 

.01627 

.01581 

.01536 

2.9 

.01492 

.01449 

.01408 

.01367 

.01328 

01288 

.01252 

.01215 

.01179 

.01145 

3.0 

.01111 

.00819 

.00598 

.00432 

00309 

.00219 

.00153 

.00106 

.00073 

.00050 

4 0 

.000,34 

.00022 

00015 

OOOlO 

00006 

00004 

.00003 

.00002 

.00001 

.00001 


By reference to a set of tables^ chi square may be evaluated. 

In tables of ^ equals the number of class intervals. The 
value indicated in the table (p) is the probability of obtaining a 
fit, due to chance, as poor as or worse than the one obtained. If this 
probability is small the likelihood that the disparities between the 
observed and actual data are due to chance is small.^ 

The value of in the problem above for iV' - 9 class intervals 
indicates a value for p of more than 30. In other words there 
are more than 30 chances out of 100 that the fit obtained would 
be as bad or as worse than the one shown. 30 chances out of 100 
the variations or departures from normality occurring in the sam- 
ple might be as bad as or worse than those actually occurring, 
due to chance fluctuations with no real departure from normality. 

> An extended set of these tables is to be found in Pearson's TabloM far Stattsficians and 
Biom«trtrian.a. 

* Generally, if the indicated value of p is less than some specified value, usually .05 or .01, 
the disorepancies are accepted as too large to be accidental. Tbs aooepted limit value of p 
(usually .06 or .01) is referred to as the fidudal point. 
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The chi square test may be used to test a large variety of hypo- 
theses in many fields of comparing the expected results (fre- 
quencies) based upon the hypothesis to be tested and the actual 
results obtained by securing observations. If the chi square test 
demonstrates that the disparity between the actual and the 
expected frequencies is too large to be ascribable to chance (if p is 
less than the selected fiducial limit of .01 or .05), the hypothesis 
may be said to be false. 


Stction of X Tgblo 


N 

n 

( N - l ) 

p « .99 

V - .05 

7 > .01 

2 

1 

.00016 

3.84 

6.64 

3 

2 

.020 

5.99 

9.21 

4 

3 

.115 

7.82 

11.34 

5 

4 

.297 

9.49 

13.28 

6 

5 

.554 

11.07 

15.09 

7 

6 

.872 

12.59 

16.81 

8 

7 

1.239 

14.07 

18.48 

9 

8 

1.646 

15.51 

20.09 

10 

0 

2.088 

16.92 

21.67 

11 

10 

2.558 

18.31 

23.21 

12 

11 

3.053 

19.68 

24.73 . 

13 

12 

3.571 

21.03 

26.22 

14 

13 

4.107 

22.36 

27.69 

15 

14 

4.660 

23.69 

29.14 

16 

15 

1 

5.229 

25.00 

30.58 
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CHAPTER XIII 
THEORY OF SAMPLING 

The Sample 

Statistical technique as applied to a given mass of data enables 
us to analyze that data. However, where the mass of data is too 
great to be handled in its entirety, given samples of the data 
subjected to the same technique enable us to generahze about the 
mass from which the sample is drawn. 

If the results are confined to the cases studied they may be 
used for descriptive purpose. A number of problems, however, 
must be solved before the results can be generalized and applied 
to the larger number of cases not included in the original study. 

Much time, energy and money wouid be needed to make a 
comprehensive analysis of a great mass of statistical data. Con- 
sequently there is every incentive to resort to study of part or 
parts of the data. This process is known as sampling. The 
validity of the results obtained depends upon the fairness of the 
sample and the technique employed in studying that sample. 

A few cases in which the sampling method is indicated are out- 
lined below. 

In the physical sciences it frequently becomes impossible to 
obtain further data and sampling must be adopted — i.e., as when 
an experiment cannot be performed beyond a given number of 
times. Also, the results of a scientific experiment repeated ten 
times can be used as a generalization of the results which might 
logically be assumed to be obtainable if the experiment were per- 
formed an infinite number of times. 

To obtain an average Intelligence Quotient for third grade 
public school students the use of any other method than sampling 
would mean the very expensive accumulation of an enormous 
mass of data. 

Again, let us say it is desired to obtain the average price of 
bread in New York City. Obviously the cost factor and the time 
required for a complete survey of the city’s thousands of bake- 
shops and grocery stores w^ould be prohibitive. 

In the last instance cited if prices were obtained from chain 
stores only the sample would be prejudiced. To secure represen- 
tative data it would be necessary to obtain sample prices from 
all the varied types of stores; from, in technical term, the entire 
population^ The requirements for a representative sample can 
be summed up into : 

> The populetloii may be defined u the entira dmU from whieh the eunple wee drewo if 
ell of It were evuleble. 
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1. The fcjample must be selected without bias or prejudice. 

2. The components of the sample must be completely indepen- 
dent of one another. 

3. There should be no underlying differences between areas 
from which the data are selected. 

4. Conditions must be the same for all items constituting the 
sample. 

The limited sample is most generally used to describe the 
larger population or group of data from which the sample was 
taken. 

When the measures computed from a sample are used to charac- 
terize the population, it is necessary to estimate the reliability 
of the measures, in other words the degree of error to which the 
generalization may be subject. 

Samples may be drawn from the underlying population in sev- 
eral different w^ays. The conditions outlined above are descriptive 
of the rauflorn sample. The values composing a sample of this 
type are drawn entirely at random from the population. 

However, it is often desirable to segregate a heterogeneous 
population into homogeneous sub groups and to draw from each 
sub group at random. This process results in the stratified sample. 
A survey of the number of rooms in homes in a given city by resort 
to the sampling method may be more effectively secured by 
drawing a random sample from uniform areas of the city. Thus a 
random sample of a low income area in the same proportion that 
the population of that area bears to the total population might 
be combined with random samples drawn in the proper proportion 
from other income areas. 

The piirposiee sample represents a deliberate selection of a 
sample manipulated by the statistician in such a fashion as to 
obtain a representative cross section of I he population. 

Measures of Reliabiliry and Significance — Standard Errors 

In the bread price problem outlined above, let us assume that 
one thousand investigators are sent out each to obtain a sample of 
prices for a given size and type of bread from 1000 stores. If 
the average prices for the samples were then arranged in the form 
of a frequency distribution they would be found to tend toward 
a "normal’' distribution. 

The average of the means of the 10,000 samples of 1000 prices 
each would undoubtedly result in a figure either the same as or 
very near the true mean of the underlying data. When the curve 
is idealized as though an infinite number of cases had been con- 
sidered, the "true” mean for bread prices in New York could be 
obtained by computing the average of the distribution. 

From the hypothetical normal distribution shown above, it 
can be seen that the means of some of the samples were quite a 
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distance from the ‘‘true"' mean of the population. If the distance 
of the mean furthest away can be obtained the greatest possible 
error will be known. It has already been shown that 99.7^ of 
the eases will be included within a distance of 3 standard deviations 
from the mean (see p. 38). If the value of the standard devia- 
tion of this distribution (of the means of the samples) can be 
obtained, in 99.7 chances out of 100 no error (difference between 
sample mean and “true” mean) will occur larger than 3 times 
the value of the standard deviation. 



The standard deviation of a distribution of means or any other 
statistical measure computed from samples is termed the standard 
error of the mean (a,) or the standard error of any other statistical 
measure. 

The error which will not be exceeded by 50% of the cases is 
known as the probable errors It is equal to .6745 times the 
standard error. 


> Although widely used the probable error is of oomparatively Lttle value. It ean be inter- 
preted ai meaning that if another aample were drawn of the aame number of itema the ohanoM 
are even that a diaorepanay between the aample and bue mean larger than the probable «Tw 
would not ezfet. 
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Extensive use of the probable error, notably by American 
statisticians, gives it a fictitious value far beyond its real worth 
as compared with the standard error.^ 

It is obvious that the greater the number of cases included 
in the sample the smaller the error to be expected. Therefore 
the standard deviation (error) of the theoretical distribution of 
means (or other measure) computed from samples will be smaller. 
In turn the standard error will vary inversely with the number of 
cases included in the sample. 

When the limits of variation in the population nr original data 
are great, for example a range of $1 to #1,000,000., a greater 
“error^' is to be expected in a measure computed from the sample 
when the range of values in the population is small; i.e., from 
$1 to $10. The standard deviation is the measure of the spread 
of data of the population. The standard deviation of the sample 
is commonly used as an estimate of the standard deviation of the 
population. It follows that the standard error of a measure 
computed from a sample will vary directly with the standard 
deviation of the population from which the sample was drawn. 

The formula for the standard error of the mean (the standard 
deviation of the distribution of the means of samples) is 


where 


Vn 

a =■ standard deviation of sample. 


The probable error of the mean is therefore 


P.E., - .6745 

Vlv 

Reliability can be evaluated if the average price of bread is com- 
puted from one sample of prices obtained from 1000 stores. If 
the following results were obtained from the sample: 

X - $.10 
0 - $.01 

the standard error of the mean would be 

$.01 

77^ - $.00032 


As shown above there are 99.7 chances out of 100 that the 'mean 
computed from a random sample of 1000 cases will not be further 
away than three standard errors of the mean from the true average 

^ R. A. Fisher points out thefthe oommon use of the probable error is its only reoomjnen- 
dation . . . when any oritioal test is required, the deviation must be expreeeed in terms of the 
standard error."* It has also been pointed out that the standard error has been oommonly 
used by European statistioiani in preference to the probable error. 

* Fisher, R. A., StaU a tical MBthoda fat R%amrc\ Wark^t, 1D2B, p. 46. 
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price. It can now be assumed that the computed average of 
,^.10 will not be more than S.00096 away from the "true"' average 
f99.7 chances out of 100). 

The probability of occurrence and the odds against the occur- 
rence of an error as great as the given number of standard errors 
as given by Pearl^ are shown in table 32a. 

The formula for the standard error of the mean, shown above, 
is for the mean of a random sample. The formula for the standard 
error of the mean of a stratified sample is'^ 



where 

ajn ^ standard deviation of the means of 
each of the strata comprising the 
sample. 

In computing the standnrd deviation of the means of the strata, 
the de\dations of the means of each strata from the mean of all the 
values must be weighted in proportion to the number of items in 
each strata. 

It will be seen from this formula that the standard error of the 
mean of a stratified sample will always be equal to or less than 
that of the mean of a random sample and the mean of a stratified 
sample therefor equally or more reliable than the mean of a ran- 
dom sample. 

The formulas discussed above apply to the means of samples 
drawn from infinitely large populations or means of samples which 
are very small in comparison to the size of the underlying popula- 
tion. When the population is finite in size, particularly when the 
size of the sample is appreciable in proportion to the size of the 
universe, the standard enor of the mean may be computed from 
the formula ^ 



or 


_ \ -L 

Vn\ A" 

where 

n = number of items in sample 
N -= number of items in population the 
standard error of the mean of a ran- 
dom sample as shown above. 


' Pearl, Raymond, Medical Biometry and Slatiatica, W. B. Saunders, Philadelphia, 1930 

• Yule, G U., and Kendall, M G., An Introduction to the Theory of Statiatics, London. 1937, 
p. 380. 

* Bowley, Arthur L., Elementa of iStoltsficc, Sixth Edition, 1937, pp 332-333. This formula 
IB given by Richardson as. 


Richardson, C. H . An Introdwiion to Statisticai Analyaie, Hmrcourt, Brace A Co., New York 
1936, p. 26D. 
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In a similar fashion the standard errors of other statistical 
measures may be computed. The formulae for some of these 
standard errors are shown below (page 119). 

This list represents merely a small number of the standard error 
formulae. For a more complete listing see the list of formulas 
at the rear of this book and Dunlap, J. W. and Kurtz, A. K, 
Handbook of Statistical Nomographs, Tables and Formulae. 


Table 32a — The Probability of Occurrence of Statistical Deviations of Different 
Magnitudes Relative to the Standard Error 


Number of 
Standard Errors 

Probability of occurence a 
deviation as great as or 
greater than designated 
number of Standard Errors 

Odds against the occurrence of a 
deviation as great as or greater 
than the designated number of 
Standard Errors 

0.67449 

50.00% 

1.00 to 1 

0.7 

48.39 

1.07 to 1 

0.8 

42 37 

1.36 to 1 

0.9 

36.81 

1.72 to 1 

1.0 

31 .73 

2.15 to 1 

1.1 

27 l.'i 

2.69 to 1 

1.2 

23.01 

3.35 to 1 

1.3 

19.36 

4.17 to 1 

1.4 

16.15 

5.19 to 1 

1.5 

13.36 

6.48 to 1 

1 .6 

10.90 

8.12 to 1 

1.7 

8.91 

10.22 to 1 

1.8 

7.19 

12.92 to 1 

1 .9 

5.74 

16.41 to 1 

2.0 

4.55 

20.98 to 1 

2,1 

3.57 

26 . 99 to 1 

2.2 

2 78 

34.96 to 1 

2.3 

2.14 

45.62 to 1 

2.4 

1.64 

60.00 to 1 

2.5 

1.24 

79 52 to 1 

2 6 

.932 

106.3 to 1 

2.7 

,693 

143.2 to 1 

2.8 * 

.511 

194.7 to 1 

2.9 • 

.373 

267.0 to 1 

3.0 

.270 

369.4 to 1 

3.1 

.194 

515.7 to 1 

3.2 

.137 

726.7 to 1 

3.3 

.0967 

1,033 to 1 

3.4 

.0674 

1,483 to 1 

3.5 

.0465 

2,149 to 1 

3.6 

.0318 

3,142 to 1 

3.7 

.0216 

4,637 to 1 

3.8 

.0145 

6,915 to 1 

3 9 

.00962 

10,390 to 1 

4.0 

.00634 

15,770 to 1 

5.0 

.0000573 

1,744,000 to 1 

6.0 

.00000020 

500,000,000 to 1 

7.0 

.00000000026 

400,000,000,000 to 1 



Measure Standard Error Formula | Probable Error Formula 
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Significance of the Difference Between Two Means 

Very frequently it is desirable to test the means of two samples 
to determine whether there is any significant diffeiencc ])etween 
them, or whether the difference, if any, is merely due to chance. 

In scientific fields it is customary to use a “control’' when carry- 
ing out an experiment. This control, to which the new technique 
is not applied, is used as a basis for comparison. It is then essen- 
tial to determine whether the measured results for the “control" 
group differs significantly from the experimental group. 

As an instance, if a new technique for teaching spelling is sub- 
jected to experimental test two groups of pupils are used. The 
new technique as applied to group 1 and group 2 is used solely 
as a control. The results (grades on examinations, etc.) are then 
tested to determine the significance of the difference between the 
mean grades for the two groups. 

If two samples of any data are drawm from a given population 
undoubtedly there will be a difference between the means of the 
samples, a difference due solely to chance variations in the selec- 
tion of items. 



Fig. 27 — Theoretical Distribution Of Difference Between Means of a 
Large Number of Pairs uf Samples Drawn at Random from a Given 

“Population.** 
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If a very large number of these pairs of samples are drawn from 
the population, and if the difference between the means of the 
pairs is arranged in the form of a frequency distribution, the re- 
sulting distribution will be normal. The true difference is in 
l-eality zero, since all these pairs of samples are drawn from the 
same population and only chance or accidental differences arise 
between the samples. If all the differences determined from a 
very large or more exactly an infinite number of pairs are averaged 
(signs retained), the true difference (zero) will result. 

The situation may then be shown by a normal curve repre- 
senting a distribution of the differences between the means of an 
infinite number of pairs of samples. The mean of this distribution 
would then be zero, the true difference, and positive and negative 
differences due to chance would arise about this mean. 

It is obvious from the information previously given about the 
normal curve that, 99.7 clianccs out of 100, no difference larger 
than 3 standard deviations of this distribution of differences 
(three standard errors of the difference between the two means) 
would arise. If, therehue, the actual difference is larger than 
3 standard errors of the difference between the means (or, in other 
words, if the probability of such a difference due to chance is 
very small) it can be said that the difference is significant and not 
due to chance. 

The standard error of the difference between two means (stand- 
ard deviation of the theoretical distribution of differences between 
means of samples) can be obtained from the following:' 

Jo = 



ffi = standard deviation of first sample 
(T 2 = standard deviation of the second sample 
N 1 = number of items in first sample 
N 2 = number of items in second sample. 

As a result of a time study a new method was outlined for a cer- 
tain operation in a factory. The average time on fifty trials 
for the operation using the old method was 17.5 seconds with a 
standard deviation of 1.5 seconds. After learning the new method 
the workmen were again timed with a resulting , average time for 
fifty trials of 15 seconds and a standard deviation of 1.2 seconds. 
It is now possible by means of the technique outlined above to 


^ This la baaed on the aaBumption that the two aamplea from which the meana were Oom- 
puted are unoorrelated. 
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determine whether the difference in average time of 2.5 seconds 
is significant or merely due to chance. 





50 


.27 seconds 


A difference as large as .81 seconds (3 standard errors of the 
difference between the two means) might (99.7 chances in 100) 
arise due to chance. Since the actual difference (2.5 seconds) is 
much larger than this amount (.81 seconds) it is extremely un- 
likely that it arose due to chance. 

The significance of the difference between any two statistical 
measures computed from two samples may be obtained from : 


A. If the tw'o samples are correlated 


where 


CTx, -= V -h (T^h - 2 ria a#, 

(j#i is the standard error of any statistic 0 computed 
from sample ^j^l 

C7^is the standard error of any statistic computed from 
sample #2 


B. If the samples are uncorrelated 
<Tz» = V 


Significance of Difference Between Proportions 

If tw^o random samples are drawn and indicate that a given 
characteristic is in a certain proportion, the difference betw’^een 
the two proportions can be tested to determine whether it is sig- 
nificant or arises out of a sampling fluctuation by use of the 
formula 

^D7o = 

where 

j) is the total percentage of occurrence 

g = 1 - p 

Ni * number in first sample 
N 2 = number in second sample 

In a study of the effectiveness-of slogans it was found that 75.7% 
of the males questioned recognized a certain slogan w^hile 66.3% of 
the females questioned recognized the slogan. The above formula 
may be applied to determine whether there was a significant dif- 
ference in the percentage of recognition by the two sexes. 
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Table 32b — Remits of Recognition Test of Paris Garter Slogan^ 

"No Metal Can Touch you/' as Given to 374 College Students 


Sex 

Number 

Recognizing 

Percent 

Recognizing 

Number 

Queationed 

Male 

209 

75.7% 

276 

Female 

65 

66.3% 

98 

Total 

274 

73.3% 

374 


From Gliok, S . Commercial Slogann, Theaia, College of City of New York, 193ri. 


V = 73.3% 
q = 26.7% 

- - V” (f. - V < <•“’> (2^ + s) 

» .052 - 5.2% 

Since the actual difference between the two proportions (75.7% 
- 66.3% « 9.4%) is 1.81 times the standard error of the dif- 
ference, there are a little over 7 chances in 100 that the difference 
is a chance difference due to sampling. 

Standard Error of Measurement! 

A certain degree of variation must be expected when physical 
measurements are performed. If a distance is measured re- 
peatedly or if a quantity is weighed several times, the results will 
show a degree of variation. 

If the average of the several measurements is taken as the true 
measurement it must be remembered that this average is a 
measurement obtained from a sample. It is therefore subject to 
a sampling error which may be computed. If a measurement is 
made 10 times it constitutes a sample of 10 measurements drawn 
from a universe of an infinite number of measurements which may 
be made. 

It has been shown previously that the error of such a mean can 
be computed through the use of the standard error or probable 
error of the mean. 

For Large Samples {N > 30) 



P.Ej - .6745 -1= 

For Small Samples {N < 30) (See page 129) 

-S 
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However it is frequently necessary to combine measurements 
to obtain areas, volumes, etc. In this case the standard error of 
tbe resulting value must be obtained. 

1 . When the individual measurements are combined by addition 
the standard error of the resulting value, may be obtained from' 

, + X = ^xi* + ^^xa* -h C^xB^ + . . . -h CJ^ 

To obtain the distance between two points the distance was 
measured in two separate sections, with the following results, as 
an a\erage of a number of measurements 

Distance #l = 500 yds. 

Distance §2 = 600 yds. 

■ These relationships are true onl> if the items are mutually independent for when the> are 
not the relationship is 

Oil + 5ti » Vcrri* +axi* irngiiUTi 

The standard error of the measuiements of the first distance 
^ 3 ^ards while that of the second 2 5 yards. 
The standard error of the entire distance, 500 yards + 600 yards, is 


^T,+ x. - 4 + 6.25 = 10.25 
CTx, 4 x, * 3.20 yards 


2. If the measurement is raised to any power (n) the standard 
error of the resulting value may be obtained from 


¥-(t) 


The measurements of one side of a square result in an average 
length of 10 feet with a standard error of .05 feet. The standard 
error of the area can be obtained as follows. 

Area = L® « 10^ = 100 squaie feet 


^x" n/*05\ , , 

Ioo“Hw ‘ 


3. The standard eiror of the product of a series of means the 
standard errors of which are known, is obtained from: 

4. The standard error of a quotient can be obtained from 

ij 
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The standard error of the volume of a cylindrical tank as obtained 
from a series of measurements of its radius and height is obtained 
as follows : 


and 


radius « 10 feet fr) 
height - 20 feet (h) 

V 3.14159(10)*(20) - 6283.18 

Jr = *1 feet 


Ch - .2 feet 


The standard error of r* can be obtained from 



and th ^ standard error of the volume from 



or in this case where V » tc r’/i 


V-/oJV , ( 

\6283.18/ \10/ \20/ 


+ 


140,5 cubic feet 



Significance of Coefficient of Correlation 

When a coefficient of correlation is computed it is necessary 
to determine whether or not the correlation indicated is a real asso- 
ciation between the two considered series, or whether the indi- 
cated relationship has arisen from the accidental selection of 
values in the samples. 

Although no real association exists between the two series 
constituting the sample it is possible to obtain a definite value 
for r when computed from a sample drawn from the universe. 
It has already been noted that a difference between the means of 
two samples may make its appearance even when both samples 
are drawn from the same population. In the same way the value 
for r may be due to mmpling fluctuations. 

If the coefficient of correlation is computed for each of a large 
number of samples of paired values, a frequency distribution of 
the resulting coefficients will be normal (if the true association 
is zero). Through application of the standard deviation (stand- 
ard error of the coefficient of correlation) it can be foretold that 
no value of r greater than three times its standard error will arise 
due to chance (99.7 times out of 100). If, therefore, the computed 
r is more than three times Cr, 99.7 times out of 100 it is significant 
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To determine the error likely to arise due to sampling, the stand- 
ard error of the coefficient of correlation may be used in the same 
fashion as the standard error of the mean. 

Fifty chances out of 100 the difference between the observed 
and the actual r will not be larger than .6745 Cr (the probable 
error of r), and 99.7 chances out of 100 this difference will not be 
larger than 3 ar. 

The formula for the standard error of the coefficient of correla- 
tion is: 

1 - 



However, when the coefficient of correlation for the underlying 
population approaches 100%, the sampling distribution cannot 
be normal or symmetrical, since the possibilities of extremes in 
one direction are limited by the maximum obtainable value for r 
of 1.00, while the range of possible values of r in the opposite 
direction is still great. Here the value of r may be converted 
for tests of reliability and significance into a more useful value of z. 

Z = M [(log. (1 + r) - log, (1 - r)] 

The sampling distribution of this value approaches normality 
and is symmetrical. 1 

Its standard error is^ ^ ^ 


Small Samples — Standard Error of Mean 

Due to serious errors which arise in this technique, where the 
number of items constituting the sample is small (generally less 
than 30) the standard errors outlined above can no longer be used. 

If the sample is small, a new standard error is computed:'^ 

, 2 (x^) N 0^ 

^ ~ N-l^ N-l 

Si - — ^ 

Vat 

But, for small samples the usual values of the multiples of 
standard errors taken to include a given percent of the cases 
can no longer be applied. The multiples to be used for various 
percentages of probability of occurrence of deviation not greater 
than given size are shown below.^ (The N' in the table for the 
standard error of the mean is iV — 1.) 

1 For B more eUboraie diBOUBBion of the etandard error of i see Fiaher, R A , iSfatiartca/ 
MelhodB for Research Workers, ^ 

* If O' IB UBed, rather than i. thie formula may be written 5, — ^ ‘ 

' Thi# ip fpoerally referred to m the "t” table ud the value in the body of the table 
M **t.’' 
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60 % 

05 % 

00 % 

1 

1.000 

12.706 

63.657 

2 

.816 

4.303 

9.925 

3 

.765 

3.182 

5.841 

4 

.741 

2.776 

4.604 

5 

.727 

2.571 

4.032 

t ) 

.718 

2.447 

3.707 

7 

.711 

2.305 

3.499 

8 

.706 

2.306 

3.355 

9 

.703 

2.262 

3.250 

10 

.700 

2 228 

3.169 

11 

.697 

2.201 

3.106 

12 

.695 

2.179 

3 .055 

13 

.691 

2.160 

3 012 

14 

.692 

2.145 

2.977 

15 

.691 

2 131 

2 947 

16 

.690 

2 120 

2.921 

17 

.689 

2.110 

2 898 

18 

.688 

2.101 

2.878 

19 

.688 

2.093 

2.861 

20 

.687 

2 086 

2.845 

21 

.686 

2.080 

2.831 

22 

,686 

2 074 

2.819 

23 

.685 

2.069 

2 807 

24 

.685 

2 064 

2.797 

25 

.684 

2 060 

2 787 

21 ) 

.684 

2 056 

2 779 

27 

.684 

2 052 

2 771 

28 

683 

2.048 

2 763 

29 

() 8.3 

2 045 

2.756 

30 

683 

2 042 

2.750 


Other Standard Errors (or Small Samples: 

Tlie standard errors for various other statistical measures when 
computed for small samples are shown below. The results ob- 
tained from these formulae may be applied in the same manner 
as the standard errors forlargje samples (See pp. 113 to 122), using 
the appropriate multiples of the standard error obtained from 
the table above. 


Measure 

Difference 
between^ 
two means 


Standard Error^ 

(For small samples) 

V (x^) + ixi) 

" (Ni - 1 ) + {N, ~ 1 ) 


Value of 
in table of 
multiples 

N' ^ Ni+ N^ 


Sd 


V 


Ni N2 

iVi + 


Coefficient of^ Sr = 

Correlation 
Coefficient of 

Correlation u, = 

in terms of “z" 


1 - 

1 


VN - 3 


N' ^ N -2 

Use multiples 
as for large 
samples. 


1 Fiaher, R. A., Siafiaftcal Methods for Research Workers. 

’ xi IB the dev lation of the aotual VEdues from the moaD of all Xi valuea. 

* The method mvolvioff tranepoBition of r to f ii generally more latiefBOtory than the one 
given here. 



128 


STATISTICAL METHODS 


ADDITIONAL BIBLIOGRAPHY^ 

Bowley, Arthur L., Elements of Statistics, pp. 178-195; 312-342. 
P. S. King & Son, London, 1907. 

Fisher, R. A., Statistical Methods for Research Workers, pp. 53-64; 
70-78; 10&-119; 123-133. Oliver & Boyd, Edinburgh 1932 

Kelley, Truman L., Interpretation of Educational Measurements, 
pp. 54-61; 156-158; 171-178; 188. World Book Co., Yonkers, 
New York, & Chicago, Illinois, 1927. 

Odell, C. W., Educational Statistics, pp. 221-241. Century Co,, 
New York, 1925. 

Otis, Arthur S,, Statistical Method in Educational Measurements, 
pp. 247-266. World Book Co., Yonkers, New York, & 
Chicago, Illinois, 1926. 

Rietz, H. L. (Editor), Handbook of Matheynaiical Statistics, 
pp. 71-77. Houghton Mifflin Co., New York, 1924. 

Ruch, G. IM., & Stoddard, George R., Tests and Measyirements 
in High School Instruction, pp. 363-374. World Book Co., 
Yonkers, New York, & Chicago, Illinois, 1927. 

Traube, Marion R., Measuring Results in Education, pp. 456-465. 
American Book Co., New York, 1924. 


• For readitiKB in standard Statistire textbookB. hcp the QUICK REFEllF’NCL TaHI F TO 
STANDARD TEXTBOOKS following Table of Contents 



CHAPTER XIV 


INDEX NUMBERS 


Definition 

The index number is a statistical device for measuring changes 
in groups of data. 

The method may be applied to many general conditions such as 
employment, prices, group health, academic grades, etc. Data 
descriptive of these general conditions fluctuate widely; but such 
data exlubit nevertheless, definite and measurable general ten- 
dencies. 

In order to measure the changes in the large number of con- 
stantly varying items in the data it is necessary to resort to some 
relative^ averaging device that will serve as a yardstick of com- 
parative measurement. The index number is such a device. 

The index number measures fluctuations during intervals of 
time, group differences of geographical position or degree, etc. 
Thus it is possible to obtain an index number showing the rela- 
tive sales possibilities for a given product in different territories; 
the academic standing of a group of college students as compared 
wdth other groups of students; or to ascertain the relative credit 
position of a single corporation as compared with many others 
in the same industry. 

Eor purposes of explanation the discussion below is largely con- 
fined to index numbers of prices. 

Index Number Construction Problems 

1. The purpose for which the index is used has a definite bear- 
ing upon the choice of the data used, the method follow^ed, etc. 

2. Careful selection of the number and types of items to be used 
is necessary so that the index fluctuations wall be truly repre- 
sentative of the fluctuations in the series. 

3. After determination of the proper method of data collec- 
tion it is necessary to find the available sources of the data needed. 
Then follow^s the necessity of actually collecting the data. 

4. The problems of selecting the base period and the best 
method of computation must be solved. 

5. The degree of relative importance of each constituent item 
to the purpose of the index must be determined. This designa- 
tion of the relative importance of each item is known as weighting. 

Index numbers are not always relative (peroentase) in form. Ooaasionally they are ex- 
pressed in terms of absolute (actual) values 
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Number and Kinds of Commodities 

Index numbers must be constructed from samples or limited 
portions of the types of prices considered; therefore a number of 
rules should be observed in selecting the commodities : 

1. The sample used should be representative (see p. 113). The 
items selected should be chosen for their representative 
quality rather than because of the ease of securing quota- 
tions. 

2. A sufficient number of items should be used. Dr. Irving 
Fisher points out that index numbers of prices are seldom of 
much value “unless they consist of more than 20 commodi- 
ties and 50 is a much better number.'^ He also shows that 
“after 50 the improvement obtained from increasing the 
number of commodities is gradual and it is doubtful if the 
gain from increasing the number beyond 200 is ordinarily 
worth the extra trouble and expense. 


The Bate Period 

The base period is assigned the vame of 100% and is thereby 
arbitrarily established as a reference period. Index numbers are 
then computed as relative to the base period. 

In the selection of the base period the following should be con- 
sidered : 

1. The base period should not be too far in the past; this in 
order that a comparison of the price level as relative to the 
base period will be of definite present or comparative value, 

2. Comparison is generally made to a “normar^ period; there- 
fore the base period should not be extreme. 

Shifting the Base 

For comparative purposes the base of an index number series 
is sometimes shifted from one period to another. The shift may 
be accomplished by dividing each number in the series by the 
index number indicated for the period to be used as the new base 
year, the result is then multiplied by 100.* 

In the following illustration the base year of the index series, 
1926, is shifted to 1928 by dividing each index number by the 
index value for 1928 (150.0) and multiplying by 100. 

1926 1927 1928 1929 

100.0 110.1 150.0 125.3 

66.7 73.4 100.0 83.5 

^ Fishar, Irving — The Making of Index iVumberi, p. 340. 

’ All typei of index numberB cannot have their base shifted in this manner (see disoussion 
below for t 3 ^eB which can be handled in this manner). In order to use a new base period oertain 
types must be oompletely reDonstnioted. 



INDEX NUMBERS 


151 


Selection of Mefhod of ComputoHon 

Irving Fisher gives over 150 different formulae for the construc- 
tion of index numbers.^ These formulae, however, are largely 
variations of a limited number of main types. 

Some of the major groups of methods of constructing index 
numbers may be classified as: 

1. The Unweighted (Simple) Method: 

a. The aggregate of actual prices. 

b. The average of relative prices. 

2, The Weighted Method: 

a. The weighted aggregate of actual prices. 

b. The weighted average of relative prices. 

Simple Aggregate of Actual Prices 

The index number constructed by the simple aggregate method 
is a comparison of the sum of the prices for tlie commodities con- 
sidered to the sum of the prices for the same commodities in the 
base period.’ 

I! Pn 

S po (Index number 

formula No. 1) 

where 

« sum of prices of commodities of any given period. 

H po sum of prices of commodities in base period. 


Table 35 — Compufafion of Index of Wholesale Mefal Prices By Unweighfed 
Aggregate of Actuals Method 

(1926 Used as Base Year) 


Metal 

Unit 

Prices In Dollars 



1926 

1928 

1930 

Pig Iron 

Ton 

520.4200 

517.6800 

517.1700 

Copper 

Pound 

.1393 

.1468 

.1311 

Aluminum 

Pound 

.2099 

.2390 

.2339 

Lead 

Pound 

.0825 

.0014 

.0538 

Zinc 

Pound 

.0737 

.0603 

.0456 

Tin 

I’ound 

.6536 

.5039 

.3163 

Silver 

Ounce 

.6211 

.5818 

.3815 

Total 

Index 


J22.2601 

100.0% 

519.2732 

86,6% 

$18 3322 
82.4% 


However, the index number computed by the simple aggregate 
method is subject to a serious defect in that those commodities 
which have large figure quotations will dominate the index. For 

^ Fisher, Irving — The Making of Index Numbers. 

■ The index may be expreeaed as a sum of money rather than a relative or percentage figure, 
i.g., the Bradstreet index of wholesale prioes for January, 1931 was 19.51. 
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instance, if there should be a decrease of 10% in the price of pig 
iron while all other commodities rose 10%, thus indicating an 
increase in the price level, nevertheless the predominating in- 
fluence of the pig iron quotation will cause the index to fall. 



Number of 
Commodities 

1926 

19— 


FMr Iron 

All other commodities . . 

1 

$20.42 

$18,378 

Decrease of 10% 

6 

1.8401 

2.0241 

Increase of 10% 

Total 

Index 

7 

$22.2601 

100% 

$20.4021 

91.7% 



The difficulty indicated above cannot be avoided by reducing 
all commodities to a common unit such as a pound — as done in 
the Bradstrcet index. Such procedure would only give rise to 
new inequalities. Applied to the problem in table 33 pig iron 
would be $.0102 a pound while tin would be $.6536 a pound (1926). 

Average of Relarive Prices 

A method which avoids the price inequalities shown above 
involves the conversion of the price figures into relatives. A 
price relative is a statement of the price of a commodity as a per- 
cent of its price in the base period. Expressed in formula form: 

ZI2 

]>0 

where 

jjn “ price in the given period. 

Po = price in the base period. 

The relative for each commodity in the base period is 100%. 
The relative for the period under consideration is averaged to 


Table 34 — Compufarion of Index of Wholesale Mefol Prices by Unweighted 
Arithmetic "Mean of Relatives" Method 

(1926 Used as Base Year) 




1926 

1928 

1930 

Metal 

Unit 

Price in 
Dollars 

Relative 

Price 

Price in 
Dollars 

Relative 

Price 

Price m 
Doll are 

Relative 

Price 

Pia Iron 

Ton 

520 4200 

100% 

517 6800 

86 6% 

517 1700 

84.1% 

Copper 

Lb. 

.1393 

100 

.1468 

105 4 

.1311 

94 1 

Aluminum . 

Lb. 

. 2699 

100 

. 2390 

88 6 

2339 

86.7 

Lead 

Lb. 

,0H2') 

100 

0614 

74.4 

0538 

65 2 

Zinc 

Lb. 

.0737 

100 

0603 

81 8 

0456 

61.9 

Tin 

Lb. 

. 6o36 

100 

. 5039 

77 1 

.3163 

48.4 

Silver 

Oz. 

.6211 

100 

.5818 

93 7 

.3815 

61 4 

Totals. . . . 
Index 



700% 

100% 


607 6% 
86.8% 


501.8% 

71.7% 
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obtain the index number. The arithmetic mean, the median 
or the geometric mean may be used for averaging. 

The formula below demonstrates the computation of an index 
number by the average of relative prices method, using the arith- 
metic mean. 


Formula : 



(Index number 
formula No. 2) 


Advantages and Disadvantages of Various Averages In Index 
Number Construction 


Arithmetic Mean 

Advantajfei 

1. The mean is relatively easy to compute. 

2. Due to long and common usage the arithmetic mean is 
commonly understood. 

3. If a weighted average is taken the means of subgroups can 
be averaged to obtain the means of values. (A w^eighted average 
may be necessary if there are a varying number of items in the 
various groups). 

DisadvanfagBS 

1. The mean is greatly affected by extremes (compare p. 21). 

2. Increases are given a greater emphasis than decreases. 
For instance, if commodity A should rise from $1 to 82, an in- 
crease of 100 percent, while commodity B fell ffom $2 to $1, a 
decrease of 50 percent, an index of the price level of these two 
commodities (if the arithmetic mean of relatives is used) will 
show an increased instead of an unchanged price level. 


1926 1929 


Commodity 

Price 

Relative 

Price 

Relative 

A 

SI 

100% 

12 

200% 

B 

12 

100 

11 

50 

Total 




2'60% 

Index 


100% 


126% 


3. The base of the index number computed by the average of 
relatives method cannot be shifted by the short method. 

Median 

Advantagmi 

1. Unlike the arithmetic mean the median will not over- 
emphasize increases. 

2. The median is less affected by extremes than the arithmetic 
mean (see p. 21). 

3. It is easy to compute. The relatives are arranged accord- 
ing to size and the middle one is selected as the median. 
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Disadvantages 

1. The median cannot be treated algebraically; i.e. the medians 
of subgroups cannot be averaged to obtain the median of all the 
data. 

2. Its value is erratic when the number of items is small. 

3. The index constructed by this method cannot be shifted to 
a new base by the short method. 

Geometric Mean^ 

AdvanfagBS 

1. The geometric mean does not overemphasize increases; 
rather, it gives equal importance to equal ratios of change. 

In the problem illustrating the disadvantage of the arithmetic 
mean (p. 133) the correct index number can be secured by use of 


the geometric mean. 




Commodity 

Trioa 

1926 

Relative 

Price 

1928 

Relative 

A 

U 

I007o 

S2 

200% 

B 

$2 

100 

SI 

50 

Geometric Mean 


100% 


100% 


2. The base of an index number constructed by this technique 
can be shifted by the short method. 

Disadvantages 

1. The calculation of the geometric mean is laborious. 

2. It is an unfamiliar form of average. 

The Weighfing of Index Numbers 

It is often desirable to assign a varying degree of importance 
to the items composing the index numbers. If this action is 
not taken each commodity will be given a weight or importance 
depending upon the size of the price, or upon some other chance 
factor, rather than a proportionate weight depending upon its 
importance. 

An objection to the unweighted aggregate of actual prices 
method of constructing an index number may be eliminated by 
introducing a deliberate system of weights. To measure the 
weight or importance of the items composing a price index the 
quantity of each commodity produced may be used. 

The Weighted Average 

A weighted average (arithmetic mean) may be obtained: 

L By multiplying each item by its corresponding weight. 

2. By totaling the results obtained. 

3. By dividing by the sum of the weights. 

2 (Items X Weights) 

Weighted Average = £ (Weights) 

> For a more oomplete explanatioo of the csometric mean see p. 26. 
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Thus if it is found that in a particular section there are two 
quotations on the price of bread, say6ji in chain stores selling 
10,000 loaves and 8^ in independent bakers selling 1000 loaves, 
a weighted average of the prices may be determined as follows: 


Price 

Chain Store S .06 
Bakery ,08 


Quantity Sold 
10,000 
1,000 


Price Timeo Quantity 
600 
80 


11,000 680 
680 + 11,000 - S .062 

Weighted Aggregate of Actual Prices 


A weighted aggregate^ of actual prices may now be computed 
by using as a weight the quantity of each commodity produced. 
The quantities produced in some fixed period, such as the base 
year, may be used as weights. The index is obtained by com- 
paring the weighted aggi*egate (total) for the given year to that 
for the base year. 

^ (Pngo) (Index number 

When “ formula No. 3a) 

jpn = Price given year 
yo =* Price base year 
Qo = Quantity base year 
Qn == Quantity given year 


For 1928 
For 1930 


2 (piqo) $1,272,012.51 
2 {poQo) “ $1,446,076 73 
^ (Mo) $1,149,875.80 
2 ipoqo) “ $1,446,076.73 


88 . 0 % 


79.5% 


However, since conditions change, the quantity of the commodi- 
ties produced in any one fixed period will not be a good measure 
of their relative importance for all other periods. To meet this 
objection a set of weights which change every year may be used. 
Thus the quantity produced in each given year may be used as 
weights when constructing the index for that particular period. 
The formula will then read: 


^ (Index number formula No. 3b) 

i (P^n) 

For 1928 

S (pigi) $1,268,414.03 
2 {poqx) “ $1,438,339.20 “ 

For 1930 

2 {p7.q^ $962,303,20 

2 {poqi) “ $1,220,635.05 “ 

> The weighted avgroffe (anthmetio mean) it obtained by dividing the turn of the vanoui 
itema multiplied bv their respective weights by the sum of the weights The wnffhted agerrepait 
or sum IB obtained by Beounng the sum of the various items times their oorreapopdmg weights 
without dividing by the sum of the weights 
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o 

P3 

(3 

I-- 

S S s 

•c O? 

Ph 

1676,034.41 

228,751.15 

33,915.50 

76,195.86 

56,398.08 

54,653,48 

23,927.30 

1 $1,149,875.78 

Frioe lo 
Dollars 

Pi 

117.1700 

,1311 

.2339 

.0538 

.0456 

.3163 

.3815 

3 

Price Times 
Quantity 

Pi Q* 

1696,114.64 

256,145.45 

34,655.00 

86,959.59 

74,579.04 

87,068.88 

36,489.91 

1-H 

to 

(M’ 

s 

S 

jl fi 

£q 

117.6800 

.1468 

.2390 

.0614 

.0603 

.5039 

.5818 

!0 

3 

1 

H i ■* 
g S 
<£«' 

$803,996.66 

243,059.00 

39,135.50 

116,843.10 

91,152.16 

112,935.54 

38,954.77 

1 $1,446,076.73 | 

If 

o d 

IJ^ 

p^b 

39,373 

1,744,860 

145,000 

1,416,280 

1,236,800 

172,790 

62,719 

d ■ 

g| s 

£a 

120.4200 

.1393 

.2699 

.0825 

.0737 

.6536 

.6211 

Unit 

Ton 

Pound 

Pound 

Pound 

Pound 

Found 

Ounce 

MetaJ 

Pig Iron 

Copper 

Aluminum 

Lead 

Zinc 

Tin* 

Silver 

Total 


Imports. 



Table 36 — Compufafion of Index of Wholesale Prices of Metals in the United States by Weighted Aggregate of Actual Method 

Using Given Year Weights 
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• Imports ••United States Production 



Table 37— Computation of Index of Wholesale Prices of Metals in the United States by Weighted Average of Relative Methods 
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Weighted Average of Relatives 

An index number may be constructed by securing a weighted 
average of the relative prices for the period under consideration. 
Quantities of production, however, can no longer be used as weights 
since each quantity is expressed in different units (tons, pounds, 
ounces, bushels, etc.). The column of figures resulting from the 
multiplication of the price relatives by these weights would be 
expressed in different units and could not be totaled. It becomes 
necessary to use w^eights expressed in common units. The most 
usual common unit is the dollar. The money value rather than 
the quantity of production may then be used as weights. 

With base period w’eights and using a w^eighted arithmetic 
mean the formula will be: 


X (M.) 

Vo 

X. f ^ (Index number formula No. 4a) 

for j}oqo equals value of production in the base period (the price 
times the quantity). 

Through cancellation the formula reduces to: 


^{Vnqo) 

S(p.ga) 


or the same as the w^eiglited aggregate using base year weights 
(formula 3a). 

If given year weights are used a new formula is evolved. 

^ =! (Index number formula No. 4b) 

i(i^ng„) 


For 1928 



For 1930 


$1,145,()()3.52 

$1,284,122.63 


89.17% 


=$782,483.62 

V(pjg0 $963,677.20 “ 


The Ideal Index Number 

Irving Fisher has developed an index number which meets the 
requirements of certain tests (see page 140) which can be applied 
to index numbers. His formula is a “cross” or geometric average 
of two formulae which are subject to opposite error. It is the 
geometric average of the aggregate of actuals weighted by base 
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year quantitie’s (formula 3a) and the aggregate with given year 
weights. The formula is : 

VFormula 3a x Formula 3b 

^ ^ (Pngn) (Indox uumber formula No. 5) 
K 2 (po^o) 2 (pogn) 

Index Number Tests 

Time Reversal Test 

If computed by using the given period as a base the result 
obtained from an index number using a certain period as a base 
should not give inconsistent results. For example, if an index 
number with a base at 1926 (1926 =» 100) should give ri«e to an 
index of 2.00 for 1928, reconstructing the index with a base of 
1928 the index for 1926 should be .50, the reciprocal of 2.00. 

1920 1928 

Index A 1.00 2.00 

B .50 ^ I 00 

Cross multiplying the index numbers as indicated by the arrows 
should give a value of 1.00, since these numbers are reciprocal.* 


Factor Reversal Test 


The change in the price times the change in the quantity should 
be equal to the change in the value of the commodities produced. 

The index of prices can be obtained by any of the methods; 
for example, the aggregate of actuals weighted by base or fixed 
year weights (formula 3a) may be used for the purpose. 

^(Pngo) 

2(p«go) 

An index of the quantity of production can be secured by re- 
versing the positions of the price figures (p) with the quantity 
figures (g). 

S(gnPo) 

-(goPo) 

An index of the value of production can be obtained by com- 
paring the value of production in the given pieriod (Fn) to the value 
of production in the base period {Vo). 


Therefore: 


^(7>»go) 


2(g7ipo) , , , yV n 

X . should equal — 


where F„ the value of production, in the given period, may be 
obtained by multiplying the price in the base period by the quan- 

1 ▲ radprooal of o Dumbor U 1 dividod by that number. 
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tity produced or ^{pnqn) and for the base period II(po^o). 
The test will then read; 


^iPoQo) 


X 


should equal 

MQoPo) 


S(PngO 

2(p«5a) 


Table 38 lists some of the more important price index number 
series. 


Quantity Index Numberi 

Index number technique can be applied to measurement of 
changes in quantity groups as well as price changes. Index num- 
bers of this type are used to measure changes in business activity, 
industrial production, commodity stocks, etc. 

The methods of construction are the same for quantity index 
numbers as for price index numbers. The simplest form is the 
simple aggregate type 

Sga 

where Sgn is the sum of the quantities in any given period 
'Lqo is the sum of the quantities in the base period. 

Since this form of index involves the sums of series the various 
quantities must all be in the same units (tons, bushels, etc.) to 
make the summation possible. 

When the units are different for the various items in the series 
and an unweighted index number is desired the average of relatives 
may be used. If the arithmetic mean is used as the average the 
formula is 


N 


It is generally desirable however, to weight the index numbers 
in order to arbitrarily assign various degrees of importance to the 
several items composing the index number. Either the price of 
the commodity or some arbitrary weight may be used for this 
purpose. 

The weighted aggregate form for use in measuring quantity 
changes is 

S (qnpo} 

with base year weights 




with given year weights 
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Where 

Published 

Survey of Cur- 
rent Buflineaa 

Survey of Cur- 
rent Business 

Dun and Bradstreet 
Monthlv Rev'iew 

Dun and Bradstreet 
Monthly Review 
Survey of Cur- 
rent Busina 

Survey of Cur- 
rent Business 

Survey of Cur- 
rent Business 

Survey of Cur- 
rent Business 

Survey of Cur- 
rent Business 

Crops and Markets 

Crops and Markets 

Survey of Cur- 
rent Business 

Survey of Cur- 
rent Business 

Survey of Cur- 
rent Business 

Survey of Cur- 
rent Busina 

Compiling 

Agency 

U. S. Bureau of 

Labor Statistics 

Dun and Brad- 
street Inu 

Dun and Brad- 
street Ino. 

Dun and Brad- 
street Inc. 

U. S. Department 
of Commerce 

U. S. Bureau of 

I abor Statistics 
Fairchild 

Publications 

1 

National Industrial 
Conference Board 

U. S. Bureau 
of Labor 

U. S. Department 
of Agriculture 

U. S. Department 
of Agriculture 

U. S. Department 
of Labor 

U. S. Department 
of Labor 

Department of 
Agriculture 

National Industrial 
Conference Board 

.1 

£ V 

< o 

O 

U. S. 

U, S. 

U. S. 

U. S. 

World 

U. S. and 
Hawaii 

U. S. 

U. S. 

U. S. 

U. S. 

u. s. 

U. S. 

u. s. 

u. s. 

u. s. 

Number of 
Comm odi ties 

784 

96 

300 

13 groups 

9 

4S 

o groups 

j groups 

6 groups 

8 groups 

3 groups 

784 

4 groups 

5 groups 

5 groups 

(L 

3 

1926-100 

1923-25-100 

1913- 100 

Deo. 1930- 100 

1923-100 

1913-100 

Aug. 1909- 
Jul\ 1914-100 
1910-1914-100 

1923-25-100 

1923-25-100 

1923-25-100 

1923-25-100 

Periodicy 

Monthly 

Monthly 

Monthly- 

Monthly 

Monthly 

Bi-Monthly 

Monthly 

Monthly 

Semi-Annual!v 

Monthly 

Monthly 

Monthly 

Monthly- 

Monthly 

Monthly' 

M 

a> 

-0 

d 

*0 

u 

B 

Is 

WhoUsiU 

Bureau c! Labor Statistics All Cora- 

moditv Index 

Dradstreet’s Commodity Price Index. . . 

Dun's Commodity Price Index 

Dun and Bradstreet Price Index . 

World Prices Index 

Retail 

Bureau of Labor Statistics Food Price 

Index 

Fairchild's Retail Price Index 

Cori of Living 

National Industrial Conference Board 
Cost of Living Index 

U . S. Bureau of Labor Index Cost of Living. 

Farm Prices 

Index of Prices Received by Farmer. . . . 

Index of Prices Paid for Commodities Used 

Parchasing Power of the Dollar 

Baaed on 

Wholesale Prices 

Retail Prices 

Farm Prices 

Coat of Living 
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Unless the units are the same for all items only the prices can 
be used as weights and not arbitrary weights since if the latter are 
used no summation will be possible. 

The weighted average of relatives may be used where the units 
are different and it is desired to use arbitrary weights 


where 



wt weight 


The “Ideal Index” can also be converted to quantity form 


1 '^ (QnVo) ^(qnj>n) 

y'^iqoVo) '^(q-Vn) 


The composition of some of the better known quantity index 
series as well as the system of weighting is shown in table 39. The 
various items constituting the index have been grouped in this 
table into major groupings to facilitate comparison. 


Table 99 — Composifion oF Selecled Indexes of Business Aclivily 
and Indusrrial ProducMon 



WEIGHTS 


Index E a 

OF Indus- I 

Indexes op 


1 TRIAL PrODDCTION 

Bubinebb Activity 

Component Senee 

Standard 

Federal 

Ne\A York 

Biiai- 


Trade and 

Reserve 

Times 

ness 


Securities 

Board 

Annalist 

Week 

Iron and Steel IVoduction 

2r, 

20.5 

25 

10 

Textile Production 

IS 

17.5 

18 


Lumber Production 

10 

8 5 

7 


Agricultural & Food Products 

9 

9.5 



Automobile & 'Tire Production 

8 

5.2 

10 


Non Ferrous Metal Production 

0 

5.2 

5 


Paper & IVinting Production 

4 

9.0 



Leather and Shoe Production 

4 

8.9 

2 


Cement Production 

4 

1.1 

8 


Coal Production 

8 

6.9 


3 

Railroad Equipment Production 

2 

0.6 



Electric J^ower Production 

2 


15 

12 

Chemical Production 

1 

3.2 



P^nameled Ware Shipments 

Stone, Clay & Glass Production 

1 

2.2 



Rubber 


1.5 



CarloadingB 

Debits to Individual Accounts 



20 

17 

30 

Building Construction 




20 

Commercial Loans 




4 

Money in Circulation 




4 
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CHAPTER XV 


FURTHER ANALYSIS OF THE FREQUENCY DISTRIBUTION 

Moments 

A frequency distribution can be more accurately analyzed if 
certain constants or “moments” of the distribution are computed. 
Moments are used for computing measures which are descriptive 
of the distribution^ and for the determination of the appropriate 
curve to be used in smoothing the distribution mathematically 
(see p. 105). 

I. The first moment of a frequency distribution as measured 
about any arbitrary origin is:* 

‘ N 

. II. The second moment (about an arbitrary origin) is: 

V,-— ^ 

ill. The third moment (about an arbitrary origin) is: 

, w) 

• ” N 

IV\ The fourth moment (about an arbitrary origin) is 

The most important moments are those which are measured 
with the mean as the origin:** 

- 

_ 

N 

w here x represents the deviation of the actual value from the mean. 

• The eymbol \ is the Greek imall letter Nu 
** The aymbol M im the Greek amall letter Mu. 
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The sum of the deviation about the mean is zero and, therefore, 
the first moment will equal zero. The other moments about the 
mean can be obtained readily from 

-r va - vi^ 

jia - Va - 3 V 1 V 2 + 2 Vi^ 

ti4 = V4 — 4 Viv* + 6 vi^va — 3vi^ 


Sheppard's Corrections For Grouping^ 

The computation of the moments from a frequency distribution 
involves the assumption that the values may be dealt with as 
though they were all located art the midpoint of the class interval. 
This assumption is subject to a certain error, allow^ance for which 
can be made by use of the corrections shown below: 

I. Corrected First Moment 

ix/ ^ 0 

II. Corrected Second Moment 

^2' = ^2 — 1 /12 

III. Corrected Third Moment 

M-a' = [ki 

IV. Corrected I'ourth Moment 

“ [Ai — 3^ M2 + 7/240 

I'br convenience, the moments calculated by the methods out- 
lined above are generally computed in terms of class intervals 
rather than in original units. To convert the moments back to 
the original units the following relationships are used; 

p.2' (in original units) = (in class interval units) 

tia' (in original units) = C® tis' (in class interval units) 

[Li (in original units) = C* (in class interval units) 

where C = size of class interval groupings. 

The computation of the moments is illustrated on the following 
page. 


^ The oorreotioDB apply only when (a) the distribution is oontinuoua (see p. 7), and when 
(b) the diitribution tapers off gradually in both direotioni. 
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Table 40 — Calculation of Moments 

Variotion of Thickneii in 600 Brass Waiheis Manufactured by the ABC Co, 

(Hypothetical Data*) 


7'hiokiieBa 
(In Inohea) 

Number of 
Washer B 

CO 

Deviation 
From 
Arbitrary 
Origin in 
Claas 
Intervals 
d' 

fid') 

(5) 

fidr^ 

(8) 

/(■f) 

(7) 

.0180-. 018;^ 

6 

-5 

- 30 

150 

- 750 

3750 

.0184- 01879 

30 

-4 

- 120 

480 

- 1920 

7680 

.0188-. 01919 

42 

-3 

-126 

378 

- 1134 

3402 

.0192-. 01959 

66 

-2 

- 132 

264 

- 528 

1056 

.0196- 01999 

94 

- 1 

- 94 

94 

- 94 

94 

.0200-. 02089 

120 

0 

0 

0 

0 

0 

.0204-. 02079 

102 

1 

102 

102 

102 

102 

.0208-. 02119 

60 

2 

120 

240 

480 

960 

.021 2-. 02 159 

54 

3 

162 

486 

1458 

4370 

.0210-. 02199 

14 

4 

56 

224 

896 

3584 

.0220-02239 

12 

5 

60 

300 

1500 

7500 


600 


— 2 

2718 

10 

32502 


* Hypothetical data baaed on a amaller diatnbution given by W. A. IJbewhart. Econom4e 
Control of Quality oj Manufactured Product 


Vi « 


Va ;= 


Vs 


V4 = 


2 (/rf') 

-2 

.V 

“ 600 

s/(rf'^) 

2718 

N 

“ 1)00 


10 

N 

600 


32502 

N 

600 


= -.0330 


4.5300 
.0167 
- 54.1700 


M-i = 0 

IJij = vj - vi* = 4.5300 - (. 0033 )* = 4.52999 

ti, = va - 3 vivj + 2 v,= = .0167 - 3 (. 0033 ) ( 4 . 5300 ) 

+ 2 (. 0033 )> = -.43177 

1*4 = V 4 — 4 viVa + 6 vi*vj — 3 Vi* = 54.1790 — 4 (. 0033 ) 

(. 0167 ) + 6 (. 0033 )* ( 4 . 5300 ) - 3 (. 0033 )* - 54.170076 


m' - 0 

lii' = (ia - 1 /12 = 4.52999 - .08333 - 4.44666 
(la' = H, - -.43177 

II 4 ' - (14 - l/2(i» + 7/240 - 54.170076 - 2.264995 

+ .029167 - 51.94429 


148 


STATISTICAL METHODS 


Curve Type Criteria 

The curve type that best describes the distribution may be 
identified from criteria calculated on the basis of the values of the 
moment. 

The criteria may be computed as follows:* 



&i(P2 + 3)^ _ 

" 4 (4^2 - 3pi) (2p, - - 6) 

By using these criteria the type of Pearson curve best describing 
the distribution may be identified (see Elderton, W. P., Frequency 
Curves and Correlation).^ 


Kurtosis 

The kurtosis of a frequency distribution is its "peakedness.'' 
If the curve has a higher degree of kurtosis than the normal curve 
(g)>3) the curve may be said to be leptokurtic. If ^ is less than 
3, the curve is more flat-topped than the normal curve and is said 
to be platykurtic. 

The measure of kurtosis is sometimes given as:* 

- 3 

Where the result is; 


1. zero the curve is mesokurtic. 

2. a positive value the curve is leptokurtic. 

3. a negative value the curve is platykurtic. 


The calculation of the ^2 value for the distribution of table 40 
is shown below: 


51.94429 
^ (4.44666)* 


2.63 


Ofher measures oF Skewness 

A more exact determination of skewness can be computed 
from** 

a. - = Wi 

^ See p. 105, eeotion on GenernliMtlon of Curves. 

■ Given by other texts also u: ' 

* The symbol P Is the Greek small letter beta and k is Greek small letter kappa. 

** The symbol a is the Greek small letter alpha. 
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The value of 03 will be zero for a normal curve. 

Another formula for skewness is* 

_ + 3) 

^ 2(5p, -6p, -9) 

This value x measure of skewness) can be used to locate the 
mode more accurately than the methods outlined previously. 

Mode = X - (yj (a) 

* SucKeated by Karl Pearson X is not to be oonfused with X’ used in testing goodness of fit 

* X IS the Greek small letter ohi 


ADDITIONAL BIBLIOGRAPHY* 

Elderton, W. P., Frequency Curves and Correlation. (\ and E 
Layton, London 1929. 

Fisher, R. A , Stahshcal Methods for Research Workers, pp. 80-105, 
228-262. Oliver & Ro>J, Edinburgh, 1932. 

Kelley, Trlman L., Interpretation of Educational Measurements, 
p. 77. World Book Co, \onkers, New York, & Chicago, 
Illinois, 1927. 

Reitz, H L (Editor), Handbook of Maihcrnahral Statistics, 
pp. 97 111 Houghton Mifflin Co , New York, 1924 

” For readin^a in sUndard Statistias textbooks see the QUICK REFERENCE TABLE TO 

SI A siDARD I LX [ HOOKS following Table of Contents 



CHAPTER XVI 
COLLECTION OF DATA 

Aisembling and Collecting Data 

Data may be obtained from primary original sources, i.e., by 
interview, questionnaire or letter; or from secondary sources, 

i.e., data compiled by other individuals or agencies. 

Primary Sovrcai 

Interview Method 

Advantages 

1. A higher degree of accuracy is attained through the acquisi- 
tion of material direct from the source. 

2. Material is often obtained that cannot be secured through 
the questionnaire. 

3. There is opportunity personally to check information ac- 
quired. 

Disadvantages 

1. Only small samples can be gathered. 

2. The subjective factor is involved in recording by interview. 

3. The method is generally inefficient, and the timje and ex- 
pense involved necessarily mean limited field coverage. 

Questionnaire Method 

Characteristics 

1. The questions should be easily understood. 

2. If possible they should be arranged in logical sequence. 

3. The answers should consist of yes or no, check or blank space, 
or numerical indication where possible, 

4. The questionnaire should be concise. 

5. It should be in the most convenient, answerable form. 

6. It should be constructed so as to facilitate the tabulation 
of data. 

Advantages 

1. A large area may be easily and quickly covered. 

2. The method of assembling data is relatively inexpensive. 
Disadvantages 

1. Frequently questions cannot be answered without a sup- 
plementary explanation. 

2. In many cases the results are unreliable. 
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3. A large part of the sample taken may not answer the ques- 
tionnaire. 

Secondary Sources 

Advantages 

1. The data are already compiled, thereby saving time and ex- 
pense. 

2. The responsibility for accuracy may be shifted. 
Disadvantages 

1. The data obtained by the primary agent cannot be verified. 

2. The statistical teclmique used may not be obtainable and 
therefore the accuracy of the results may not be verifiable. 

3. Subjective compiling and interpretation may have influ- 
enced the result shown. 

4. The purpose of the study may have prejudiced the choice 
of source material and technique adopted. 

5. A representative sample may not have been taken. 


ADDITIONAL BIBLIOGRAPHY* 
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CHAPTER XVII 


STATISTICAL TABLES 

Definition 

The statistical table is a systematic arrangement of numerical 
data presented in columns and rows for purpose of comparison. 

Statistical tables, classified according to purpose, are of two 
types, general purpose (primary) tables and special purpose 
(derived or text) tables. 

General Purpose Tables 

Functions 

1. The primary function of the general purpose table is to 
present original data in tabular form for reference purposes. 

2. It serves as a source of information where original data is 
needed. 

3. It is used in the construction of special purpose tables. 
Characteristics 

1. The general purpose table presents varied information on 
the same subject. 

2. It should contain absolute, not percentage, figures because 
of its purpose as outlined above. 

3. Information should be presented in such form that it can 
easily be used for reference. 

4. Actual figures, not round numbers, should be included. 

Special Purpose Tables 

Functions 

1. The primary function of the special purpose table is to 
present data so as to emphasize specific relationships. 

2. It is used to emphasize a particular phase of the general infor- 
mation contained in a general purpose table. 

3. It permits the presentation of selected materials in simple 
form. 

Characteristics 

1. Round numbers may be used at times. 

2. The selected material in a special purpose table is presented 
in a small space to facilitate interpretation. 
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Tabu 41 


TITLE PU bon Pioductlon and Pricoi fin the United Statei, 


BOXHEAD 


STUB 


SOURCE 


1919-1930 


Year 

Production 
(Thoueande of 
GroBB Todb) 

PnocB* 
(Dollars per 
GroBB Ton) 

1919 

31,016 

J28.97 

1920 

36,926 

42.76 

1021 

16,688 

22.68 

1922 

27,220 

24.06 

1923 

40,361 

26.30 

1924 

31,406 

20.90 

1925 

36.700 

20.58 

1926 

39,378 

20.42 

1927 

36,566 

18.55 

1928 

38.156 

17.68 

1929 

42,614 

18.43 

1930 

31,399 

17.17 


•Composite of weekly average prices 
on found^ and basic pig iron at Valley 
furnace, Chicago, Birmingham. 

Source: Iron Age. 


COLUMN 

CAPTIONS 


UNITS 


FOOTNOTE 


Rules for Table Construction 

Practice varies^ but the generally accepted rules for the con- 
struction of a statistical table are as follows:' 

1. Title — the title should be self explanatory and should indicate 
in the following order: 

a. the nature of the data presented 

b. the locality covered 

c. the time period included. 

The title is placed above the table. The lettering is usually 
larger in the title than in any other section. 

2. Source: The source of the material should always be in- 
dicated on a table, except where original data has been obtained, 
since it is used : 

a. to indicate the authority for the data 

b. as a means of verification 

c. as a reference for additional data. 

The source is placed below the table at the left. 

3. Footnotes: A footnote is used to further explain a figure in 
the table etc. It is placed immediately beneath the table, above 
the source. The footnote should be indicated by symbols as *, # 
etc , or by a letter of the alphabet, never by a number, since the 
latter might be interpreted as part of the table. 

* Ezoeptiona to the B^nerally looepted procedure of tmble oonstruotion are uauallv Juatified 
by the partioular purp43se of a ipeoifio table. 
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4. Arrangement of data: Items in a table if arranged carefully 
facilitate reading of the table, analysis and comparison of data 
and permit emphasizing of selected groups of data. Items may be 
arranged : 

a. alphabetically — according to the alphabetic order of the 
items. This is the most frequently used arrangement for 
general purpose tables. 

b. chronologically — according to time of occurrence in com- 
paring subjects over a period of time. Dates should move 
from the earliest to the latest date from the top of the stub 
to the bottom or in the boxhead from the left to the right 
of the table. ^ 

c. geographically — according to location in the customary 
classification for example: country, state, county, etc., or 
Maine, New Hampshire, Vermont, etc. This arrange- 
ment is generally confined to reference (general purpose) 
tables. 

d. by magnitude — according to size. The largest number 
is placed at the top of the column and the others arranged 
in order of size. The row captions correspond to their 
values. When the row captions are numerical, as class 
intervals in the frequency distribution, they are arranged 
by size. The smallest number is arranged at the top for 
the rows with the largest at the bottom; for the columns, 
the smallest is placed at the left to the largest at the right. 

e. by customary classification — There is a customary ar- 
rangement for many types of data which do not follow^ any 
serial arrangement. For instance the classification men, 
women, and children, is rarely listed in the order women, 
children, and men. 

5. Colunms: When there are a number of columns in a table 
they may be numbered or lettered for reference purposes. 

6. Column Captions; The heading of each column is known 
as the column caption. It should be concise. A miscellaneous 
column is placed at the right end of the table. 

7. Stub : The heading of a row is know n as the row caption. The 
section of the table containing row headings is designated as the 
stub. Items in the stub should be grouped, as months grouped by 
quarters, to facilitate interpretation of the data. 

8. Totals : The totals of columns should be placed at the bottom 
of the columns, wliile row totals should be placed at the extreme 
right. ^ 

^ An exception to this rule occurs when the latest fiKuree are of primary interest as when the 
figures are published for the first time In this instancr the latest figure ma^ be listed before the 
others, and then separated from them b\ a double or heavy line 

* The United States Census Bureau places totals at column headings and on the extreme 
left. This practice is explained by the Department os due to the major interest in totals 
(nee footnote 1 above). 
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9. Units of Measurement: These should be included in the 
boxhead under the column captions. 

10. Rulings; Lines should be ruled on a table as follows; 

a. A horizontal line is placed below the title, and below the 
body of the table. 

b. Columns are separated by single lines. In typewritten 
columns these lines are not essential but are useful. 

c. The stub and boxhead are separated from the figures by 
double or heavy lines, especially in non printed tables. 

d. Totals are separated from the other figures in a column 
by a single line. 

11. Emphasis; A double line, heavy line, italics and light and 
bold face type contrasts are all used for emphasis on tables. 


ADDITIONAL BIBLIOGRAPHY* 

Bowley, Arthur L., Elements of Statistics, pp. 52-81; 117-124. 
P. S. King & Son, London, 1901. 
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GRAPHIC PRESENTATION 


Definition 

A graph is a method of presenting statistical data in visual 
form. 

Types of Graphs 

There are many varieties of graphs. The use of a particular 
type is dependent upon the data and upon the purpose f )r which 
the graph is constructed. 

Graphs may be divided into the following types 

I. Line or Curve Graphs 

a. Arithmetic ruling 

b. Semi-logarithmic or logarithmic ruling 

c. Other special rulings 
Special Types of Line Graphs 

a. Silhouette chart 

b. Band chart 

c. High Low Chart 

d. Histogram 

II. Bar charts 

III. Area Diagrams 

IV. Solid Diagrams 

V. Statistical Maps 

Rules for Consfrucfion of Graphs^ 

1. Every graph must have a clear and coiitise title which is 
generally placed at the top center of the graph. ^ As a rule tf'*’ 
title includes information as to: 

a. The nature of the data. 

b. The geographical location 

c. The period covered 

These elements of the title custoniaiily appear in the order given 
above. 

* For a more complete discussion of the technique of graphic presentation the reader is 
referred to Arkin, If , and Colton R , Graphs, Harper & Dros , 1935 

’ Graphs in printed form generially ha\ e the title placeil below the graph see graphs in this 
teit 
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2. Coordinate lines should be held to a minimum and curve 
lines emphasized so that the curves stand out sharply against the 
background. 

3. The source of the data should be indicated just under the 
graph at the lower left. 

4. Footnotes, if any, should be placed under and to the right 
of the graph. 

5. If the graph is to be readily understood the curve lines, 
segments and other details should be as few in number as possible. 

6. Each scale must have a scale caption indicating the units 
used. 

a. The X axis scale caption should be centered directly 
beneath the X axis. 

b. The Y axis scale caption should be placed at the top of 
the Y axis. 

7. The zero point should be indicated on the scale ( Y axis), 
otherwise a misleading comparison may result. The necessity of 
indicating the zero point is seen by comparison of the peaks at 
a and h in the two graphs below, figure 28.^ 




Graph 1 

(No Zero line shown) 


Graph 2 

(Zero line indicated) 


Fig. 28 — Steel Ingot Production in the United States, 1926 — 1931. 


Inclusion of the zero point ( Y axis) in graph 2 indicates an 
entirely different ratio in the heights of the points at 1928 and 1930 
1 5 to 4) instead of the misleading comparison 2 to 1 in graph 1. 

^ An exception to this rule occurs when the graph is in percentage form. In this case the 
100 % line IB emphasized 
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If, however, lack of space makes it inconvenient to use the 
zero point line a scale break may be inserted to indicate its omis- 
tion. Various types of scale breaks are shown in figure 29. 

9. The scales of values should be placed along the X and Y 
axis, thus giving a general indication of the size of the variations 
occurring in the graph. 

(It is unnecessary to indicate fine gradations on the scales 
of value since it is not intended that actual values be 
read off from the graph. Actual values can be obtained 
from the table of original data which usually accompanies 
the graph.) 




Fig. 29 — Types of Scale Breaks. 

10. If a space on the X axis is used to indicate time intervals, 
the point representing the value for each period should be plotted 
at the midpoint for the period. If desired, however, the periods 
may be made to coincide with, and the points may then be plotted 
on, given coordinate lines. 

11. On the Y axis the scale of values should run from zero 
(or from the smallest value) on the bottom of the graph to the 
highest value at the top. On the X axis the values should run 
from lowest on the left to liighest on the right. 

The various elements composing the graph are shown in figure 
30. 

I. Line or Curve Graphs 

The line or curve graph is distinguished by the fact that the 
variations in the data are indicated by means of a line or curve 
(see figure 30). 

This type of graph is constructed by plotting points whose 
positions are determined by their respective values on the X and 
Y scales. The points are connected by straight lines. 
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Line graphs may be classified according to the type of scale 
ruling used : 

a. Arithmetic ruling 

b. Logarithmic rulings 

c. Other rulings^ 

Arithmetic Rulings 

Arithmetically ruled paper has equal distances between the 
coordinate lines. ICqual quantities will then have equal distances. 
Thus the distance between 1 and 3 on the background ruling will 
be the same as that betw^een 8 and 10. 

An arithmetic progression will plot as a straight line on arith- 
metic paper since there are constant differences between the suc- 
cessive values in this type of series. 

Since equal amounts are assigned equal distances, equal changes 
indicate identical absolute differences. 

The line or curve type of graph is the most commonly used 
form of graphic presentation. 


Logarithmic and Semi-logarithmic Rulingi^ 

When it is desired to compare percentage rather than absolute 
changes a somewhat diffeient form of ruling is used. 

It can be showm^ that where there is a constant percentage 
change between two pairs of figures the differences between the 
logarithms of the figures will be equal. 

Various other rulings are also, available but are beyond the scope ol this disousBion. 

’ Semi-logarithmio paper is also known as ratio paper. 

^See any text on elementary mathemetioe 
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Ntiinben 


Losarithmi 


2 

4 

Difference 


0.30103 

0.60206 


0 30103 


100% increase 


Numbers 


LoKBjithms 


6 0.69897 

10 1.00000 100% increase 


Difference 


0.30103 


Thus if the logarithms of the values rather than the original 
figures are plotted constant differences (rises or falls) will then 
equal constant percentage changes. 

Since, however, a great deal of time and effort is required to 
convert the original data into the form of logarithms, a more 
convenient procedure is to arrange the scale so that the logarithms 
may be plotted directly by reference to a special scale. 


Ai I f'hmetic 
3 -i- 


? H- 


I -T 


0 


Lo^Bnfhmic 
1000 


100 -t 


10 ± 


I 


The logarithms in the longer procedure may be plotted on the 
arithmetic scale in the usual fashion. Thus, if it is desired to 
plot the value 2 its logarithm is detemained (0.30103) and this 
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value plotted. If, however, a scale is prepared in advance with 
the position 0.30103 marked 2, then the data may be plotted 
without previously determining the logarithms of the values. 

The relation between a simple arithmetic scale and a scale 
corresponding to it but prepared for plotting logarithms is shown 
below: 

If a logarithmic ruling is used on both the X and Y axis the 
paper is known as logarithmic, if used only on one axis it is semi- 
logarithmic. 

Since time is generally placed on the X axis an arithmetic ruling 
is used on this axis in semi-logarithmic paper while the logarithmic 
ruling is retained on the Y axis. 

Characteristics of Logarithmic Charts 

1. There is no zero or base line. 

2. Semi-logarithmic charts have an arithmetic scale on the 
horizontal axis. Logarithmic charts are ruled logarithmically 
on both scales, 

3. When plotted on logarithmic paper a geometric progression 
forms a straight line, since the logarithms of a geometric pro- 
gression form an arithmetic progression. 



1925 1926 1927 1928 3929 1930 1931 1932 

Source: Federal Reserve Board, Federal Reserve Bulletin. 


Fig. 31a — Exchange Rates on the Franc and the Pound Sterling, 
1926 — 1932. (Plotted on Arithmetic Paper.) 
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Fig. 31b — Exchange Rates on the Franc and the Pound Sterling, 
1925 — 1932. (Plotted on semi-logarithmic paper.) 


4. Equal rises or falls indicate equal percentage changes. 

5. Equal slopes on a logarithmic chart denote equal rates of 
change. 

Logarithmic Charti are Used: 

1, To compare proportional rates of change. 

2. To show the relationship between the two or more series 
which differ widely in amount. 

a. The unsatisfactory nature of arithmetic paper as com- 
pared with semi-logarithmic paper for this purpose can 
be seen from figure 31. 

Special Types of Line Graphs 

1. Silhouette charts are line graphs showing the positive and 
negative deviations from a zero or base line with the area between 
the zero or base line and the curve filled in (see figure 32). 







GRAPHIC PRESENTATION 




Silhouette charts are constructed by plotting points indicating 
the actual deviations from the base line. The points are then 
connected and the area between the curve and the base line 
filled in. 

2. Band charts are a form of line graph which show variations 
in the component parts as well as the total. 

The chart is prepared by first plotting the variation in the 
largest component part of the total. This segment may then be 
shaded in or cross hatched. The next component part is then 
added to this first segment and the result plotted. This cumu- 



Source: Standard Trade and Secuiitlea Serrlce, StatUtieal Bomb Booh, 
Fig. 32 — Gold Movements from the United States, 1919 — 1933. 

(Silhouette Chart). 

lative process is then continued until all of the component parts 
have been included. The variations in the top line will then 
represent variations in the total while the variations in the width 
of any segment will indicate the variations in that particular 
component part. Figure 33 illustrates this type of grapht 
3. High -low graphs are a form of line graph w^hich present 
not only the changes occurring over a period of time but the 
fluctuations occurring within each period (as day, week, month, 
etc.) as nell, indicating the high and low values. 

The high-low chart is constructed by first plotting the lowest 
value for a period and then the highest value for the same period. 
This procedure is continued until the end of the time covered by the 
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graph. The low point and the high point for each period are con- 
nected by means of heavy lines. These lines since they are 
closely spaced tend to take the appearance of an irregular band. 

4. The histogram, also known as a rectangular frequency 
polygon is constructed from a frequency distribution in the follow- 
ing manner. Rectangles are erected using as the width the size 
of the class interval and as the height the frequency in each class 
interval. See page 3, 



M I S r e 1 lane ou-S 


Source. Standard Trade and Securities Service, Statistical Base Book. 

Fig. 33- -Commercial Failures m the United States, by Types, 
1924 — 1930. Band Chart. 


Bar Charts 

Bar charts visually contrast quantities by a comparison of bars 
of varying length but uniform width. 

Bar charts may be subdivided into four types, namely: 

1 . Absolute 

a. Simple 

b. Subdivided 

2. Percent 

a. Simple 

b. Subdivided 

Simple Absolute Bar Charts 

Rectangular bars of the same w idth are erected from the same 
base line to proportionate lengths based on absolute or actual 
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data. The bars may be set up either horizontally or vertically; 
however, when the scale involves time the vertical type of bar is 
indicated. See graph A, figure 34. 

Subdivided Absolute Bar Charts 

The bars are subdivided according to the size of each com- 
ponent. The components of each bar are arranged in similar 
order with the largest subdivisions at the base. The figures may 
vary to such an extent that the largest subdivision may not re- 



1920 1925 1930 1920 1925 1930 


Manufactured ■ Foodstuffs^ 
Crude Materials S 
B 



Percent 

75 | 



1920 1925 1930 

Manufactured ■ Foodstuffs^ 
Crude Materials ^ 


Source: United States Department of Commerce. 


Fig. 3^Exports from the United States. 1920, 1925, and 1930. Shown 
as various forms of bar charts. (Chart C is for Manufactured exports 
as percent of total.) 
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tain its position, nevertheless the order of arrangement should 
remain fixed. Subdivided charts are cumulative in that each 
subdivision in plotting is added to the total of the subdivisions 
below it (see graph B, figure 34). 

Simple Perceniage Bar Charts 

The bars are constructed in a similar fashion to the method 
used in simple absolute bar charts except that the lengths of the 
bars represent percentage values (see graph C, figure 34). 

Subdivided Perceniage Bar Charts 

Rectangular bars of the same width and the same length are 
constructed on the same base. The length represents 100 per 
cent. Each bar is divided into segments, the size of each segment 
being dependent on the percent of the total figure which each sub- 
division represents.^ 

The subdivisions of each bar are arranged in the same order of 
presentation with the largest percentage at the base (see graph 
D, figure 34). 

A special type of subdivided percentage bar chart is that which 
makes use of a single bar. The single bar is used when interest is 
centered on the component parts of a single total. The entire 
length of the bar represents 100 percent and each segment is repre- 
sented in order of size from left to right. 

Pictorial Bar Charts 

Bar charts may be constructed in pictorial form. Pictures of 
different heights may be used for comparative purposes. Thus to 
represent the gold holdings of the United States Treasury at dif- 
ferent periods stacks of coins of varying heights corresponding to 
the values they represented may be used. 

Lois and Gain Bar Charts 

This type of bar chart is constructed by having the bars extended 
from a zero line. If the bar chart is constructed horizontally the 
bars representing losses extend to the left, profits to the right. If 
the bar chart is constructed vertically the bars above the horizontal 
normal line represent profits, the bars below the normal line repre- 
sent losses. 

Area Diagrams 

The area diagram contrasts quantities by comparing figures with 
varying areas. Area diagrams may be of many types the simplest 
making use of geometric figures (such as circles and squares). 
Area diagrams are of two types. In the first type total areas of 
different sizes may be contrasted by varying the sizes of the 
figures. In the second type subdivisions of a single area may be 
compared. 

* In ■ubdividad bnr ohirti tbe number of lubdiviiioiiB ihould be m few u poMible. 
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The most usual type of area diagram is the pie chart (see 
figure 35). 

Pie Chart 

Definition 

A pie diagram is a chart of circular shape broken into subdi- 
visions. The size of the section indicates the proportion of each 
component part to the whole. 


Construction 

1. Let the circle equal 100 percent 

2. Each circle is divided into 360®. 

360® 

3. . ■ . each percent « or 3.6® 



Bouroe: United States Department of CooLmerca 

Fig. 35 — Exports from the United States, by Economic Classes, 1930. 

CharacterUtlcM 

1. The arrangement of the size of sectors is generally clockwise 
according to size. 

2. A uniform arrangement of sectors must be made in com- 
paring charts. 
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3. Wherever possible wording and percentages should be 
placed horizontally on the sector. 

4. If shading, cross hatching, colors, etc, are used in place of 
wording on sectors, a legend should be constructed to indicate their 
meaning. 

5. The effectiveness of the pie chart is enhanced by cross 
hatching, colors, shades, etc. 

6. A pie chart should have a minimum of sectors. 

7. The pie chart is difficult to construct accurately. 

8. It is difficult to estimate visually with any degree of accuracy 
the proportionate size of the sectors of a pie chart where per- 
centages are not indicated. 

Solid Diagrams 

Solid diagrams consist of geometric forms (cubes, spheres, 
cylinders, etc.) or irregular figures constructed to illustrate com- 
parisons of magnitudes through comparison of volumes at the 
figures (see figure 36). The volumes of the figures in a solid 
diagram as compared and not the heights or lengths of the figures. 

The solid diagram makes accurate comparisons difficult and 
for this reason it should not be used if some other method of 
illustration is possible. 

Map Graph 

Function: The map graph presents in pictorial form the facts 
in a geographic distribution. ‘ 

Construction: Map graphs are of five major types: 

1. Shaded 

2. Cross Hatched 

3. Dotted 

4. Colored 

5. Pin 

a. Tacks 

b. Pins 

c. Flags 

Shaded Maps 

The proportionate ciuantitics for particular areas may be in- 
dicated by using various degrees of shading ranging from solid 
black to white. 

Cross Hatched Maps 

Cross hatching may be used to indicate varying quantities by 
varying the proportions of black and white space. 

> A KBop-aphio dtitribution ii known alio ai a '‘ipatial” diitribution 
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Great Britain 



Billions of Dollars 
1 2 3 4 5 



Fig. 36 — Gold Stock of the United States and Great Britain, December 
31, 1930. Shown in Solid Diagram and Bar Chart Form. , 

Dotted Maps 

Dots (circular areas) on maps are used primarily in two ways : 

a. Dots of similar size are placed on a map the primary pur- 
pose of which is to indicate the density of the numbers, 
etc., in an area by varying the number of these dots. 

b. Dots of proportional sizes are placed on a map to indicate 

the total number or sizes in an area 
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c. Dots of a fixed size each with the same assigned value may 
be valued in number to indicate the various quantities for 
each area. 

Care must be taken in denoting relative sizes since comparisons 
must be made by varying the area of the dot. 

Colored Maps are constructed by using 

a. Various colors to indicate variations in sizes, etc. 

A variety of colors should not be used to indicate relative 
values since an individual color is of no greater value than 
any other color to the observer. 

b. Various degrees of the same color to illustrate relative posi- 
tions of different areas. The difficulty with using a single 
color scheme is that there is a limited number of shades 
which can be satisfactorily used. 

Map-Tack Syitem 

Maps using tacks, flags, etc., are used for a large number of pur- 
poses in indicating relative sizes and also densities in a geographic 
area. 

The heads of the pins or tacks may be of various colors, sizes, 
and shapes and thus extend their flexibility for the indication of 
sizes, locations, routes, etc. 
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CHAPTER XIX 


SPECIAL TECHNIQUES IN EDUCATION, PSYCHOLOGY 
AND BIOLOGY 

Special Techniques in Education and Psychology 
Standard Scores^ 

In order to compare the results of two or more tests given to 
a number of students the tosh should be of equal difficulty. If 
they are unequal in difficulty the average grades for each examina- 
tion may differ widely, as will the resulting dispersion of the grades. 
The difficulty in the compaiison of grades can be seen in the fol- 
lowing distribution of the results for two examinations: 


student Examination 


Number 

Number 1 

Number 2 

1 

100 

100 

2 

99 

85 

3 

99 

70 

4 

98 

68 

5 

97 

65 

6 

96 

60 

7 

90 

60 

8 

90 

60 

9 

87 

50 

10 

80 

45 


The grades or scores may be standardized by converting each 
grade into a deviation fniin its respective arithmetic mean and 
dividing by the standard de\iation to allow for differing average 
attainment and dispersion of grade. 

X 

z - - 
•a 

where 

z - standard score 

X - deviation of given score from mean. 

ff - standard deviation of original grade. 

The grades on both examinations will average to the same 
value — zero (since S (j) = 0) — and since each grade has been 
made relative to its standard deviation- the standard scores on 
both tests will have the same degree of dispersion. 

• Compare Kelley, T L., iSfo^isttra! Methods 

* If it IS desired to have a fixed average score of 50 in each test the formula * « 50 1- * f ^ 

ma\ be used ° 
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If a pupil is consistent the standard scores attained on two tests 
measuring different traits, such as silent reading and arithmetic, 
should be equal: 

A large divergence in the value of these standard scores indicates 
a definite idiosyncrasy of the pupil considered. 


The CoeFficient oF Reliability 


The reliability of any test is measured by the similarity of results 
attained when the same test is given a number of times. If the 
reliability of the test or other measuring instrument is perfect, 
exactly similar results (allowing for chance variation) should be 
obtained when the test is given twice or the coefficient of correla- 
tion between the two sets of scores will be 100. 

The coefficient of correlation of the scores secured from two 
applications of the same measuring instrument is thus a coefficient 
of reliability. 

However it is usually not practical to give the same examination 
or the same form of examination twice. In place of this it is 
preferable to use the coefficient of correlation between the scores 
attained by dividing the questions into two parts. For this pur- 
pose the odd numbered questions (1, 3, 5, 7 etc.,) may be used as 
one set and the even numbered questions (2, 4, 6, 8 etc.,) as the 
other. For this purpose the distribution of questions between the 
two sets must be at random or at least the two sets of questions 
must be of equal difficulty. 

The increase in the reliability of a test secured by lengthening 
it or repeating it a number of times in the same form may be com' 
puted from the Spearman-Brown formula 


where 


nrii 

1 -h (n — l) Tii 


r„ is the increased reliability coefficient resulting from either 
increasing the length of the test n times or repeating it n times. 

Tn is the coefficient of reliability for the original test. 


The Intelligence Quotient 


The scores on various intelligence tests given an individual 
will increase with his age. It is necessary, therefore, to relate 
the mental age as indicated by the intelligence tests to his chron- 
ological age. 


LQ. 

where 


MA 

CA 


I.Q. = Intelligence Quotient 
MA = Mental age 
CA * Chronological age. 
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For statistical purposes the normal intelligence quotient for 
any age will then be 100. 

In order to calculate the value of the intelligence quotient the 
chronological age is increased until it reaches a maximum at which 
it is retained. Otis^ argues that the maximum age should be 18 
rather than the generally used maximum of 16 years. 


Subjecf Quotients and Ratios 


Pupil accomplishment in a particular subject varies according 
to chronological age. To obtain a true picture of relative ability 
in a given subject it is necessary to compare “subject age" to 
chronological age. 


Arithmetic Quotients 


Arithmetic Age 
Chronological Age 


Reading Quotients 


Reading Age 
Chronological Age 


Any subject Quotients 


Any subject Age 
Chronological Age 


The average of a pupil's subject quotients is known as his 

educational quotient. 

If the nientnl age is used in the quotients outlined above tlu^ 
result is know n as a subject ratio, etc. 


Subject ratio 


Subject Age 
Mental Age 


The accomplishment ratio may then be obtained by averaging 
the subject ratios. 


Special Techniques in Biology 

Index of Abmodality 

A deviation from average or type is of little significance unless 
the deviation is related to the customary dispersion of the data. 
A deviation of two inches from the average height of a man of a 
certain age is of little import unless compared to the ordinary 
or usual dispersion of man heights of the same age. The same 
principle would hold true in the deviation of an inch from the 
average length of a squirrel of a specified age, etc., etc. 

In order to take the dispersion of the data into consideration 
the deviation from the mean may be related to the standard devi- 
ation of the data.* 

X 


This measure is known to biologists as the index of abmodality. 

* OtiB, Arthur S., Method in Educational Measurement, 192C, p 1 '’►0. 

* In the 6eld of education thii ii known ai the etandarcl loore (lee page 111) 
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The index of abmodality for an essentially normal distribution 
indicates the number of standard deviations the given value is 
from the mean. Thus if the index attains a certain value it 
may be further interpreted in light of the previous discussion on 
the normal curve (see chapter XI). 

It is known that a deviation larger than 3 standard deviations 
from the mean (in any one direction plus or minus) will occur 
less than 2 times in 1000. This knowledge is obtained from the 
demonstrable fact that the area wdthin 3 standard deviations from 
the mean will include 49.87% of the cases (on one side only of 
the mean), and therefore only .15% of the cases wall be larger 
than the value of the index of abmodality if it equals 3. Any 
given value of the index of abmodality may thus be interpreted 
with the aid of the normal curve area table. 

Coefficient of Heredity 

When the coefficient of correlation is applied to the measurement 
of the association between a specific characteristic of a parent 
and the same characteristic of an offspring it is knowm as the 

coefficient of heredity. 

The coefficient of heredity betw'een fathers and offspring is 
assigned the symbol ri betw^een mothers and offspring 

Coefficient of Assortative Mating 

When the coefficient of correlation is used to measure the 
association between a specified characteristic of fathers and the 
same characteristic of mothers it is known as the coefficient of 
assortative mating. The symbol assigned to this coefficient is rj. 

Variability of Offspring 

The variability (standard deviation) of a group of offspring 
from particular parents may be determined from the following 
formula: 

ci» 

where 

g8 li - standard deviation of an array of offspring. 

Ga “ standard deviation of offspring in general. 
ri - coefficient of heredity between offspring and parents, 
assuming parents to be equipotent (ri * ri), 
fi ■* Coefficient of assortative mating. 

Abmodality of Offspring 

The average abmodality of a group of offspring from parents 
of fixed characteristics may be computed from the formula: 
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where 

hs - deviation of mean of given characteristic of offspring from 
mean of characteristic of all offspring. 
hi « deviation (abmodality) of father. 
h2 * deviation of mother. 

(ji = standard deviation of characteristic in fathers in general. 
(Ti = standard deviation of characteristic of mothers in general. 
C3 = Standard deviation of offspring in general. 

Ti = coefficient of heredity (assuming ri = r*). 
r* « coefficient of assortative mating. 


Viral Statistics 

Data relative to deaths, births, and sickness are significant only 
if considered in relation to the size and kind of population from 
which they were drawn. Thus the fact that 2,000 deaths w^ere 
recorded in a year in a particular (ity is of no significance unless 
the population of the city is known. If the 2,000 deaths were 
recorded in New York City with a population of approximately 
7,400,000 an entirely different significance would be attached to 
such a record than if it w^ere recorded in a city of 25,000 population 

A rate is an expression of the number of times a specific kind of 
event occurs in a given population in relation to the total number 
in the population exposed to the possibility of its occurrence. This 
may be expressed in the form of a formula as: 


wdiere 


Rate 


a 

0 + 6 


a = number of times event appears in the population 
b = number of times event does not appear in the population 


The resulting \alue is in decimal form but is generally mul- 
tiplied by 100, 1,000, 100,000 or 1,000,000 to give the result as 
percent (per 100), per 1000, per 100,000 or per million. 

A ratio expresses the relation of occurrence of a given kind of 
event to the occurrence of other events or of one kind of data to 
another. In formula form this is: 


wdiere 


Ratio =» 


a 

c 


a = number of times event occurs 


f = number of times another event occurs 


Vital statistics make use of birtl;, death and morbidity rates. 
The^^p lates arc important also in medical and actuarial statistics. 
Birth, death and morbidity rates may be classified as follows :i 

'This ol ansi fi nation la ftftnr Pearl, Ravinond in Afedteoi Bxometrv and Siatiaiict. 
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A — Mortality Rates (Death Rates) 

1 . Observed 

a. Crude death rates 

b. Specific death rates 

2. Theoretic Death Rates 

a. Standardized death rates 

b. Corrected death rates 

B — Natality Rates (Birth Rates) 

1. Observed 

a. Crude Birth Rates 

b. Specific Birth Rates 

2. Theoretic Birth Rates 

a. Standardized Birth Rates 

b. Corrected Birth Rates 

C — Morbidity Rates 
1. Observed 

a. Crude 

b. Specific 

Crude Death, Birth and Morbidity Rates 

The crude death, birth or morbidity rates are merely the total 
number of deaths, births, or cases of sickness divided by the total 
population 

Crude Death Rate = ^ 

Crude Birth Rate - ^ 

. . M 

Crude Morbidity Rate « -p 

where 

D = number of deaths 
B = number of births 
M = number of persons sick 

Specific Death Rate 

Although these rates must be specified as to time and place 
they are crude in that they do not include specifications as to age 
or sex. When such specifications are made the rate is known as 
the specific death, birth, or morbidity rate. 

Specific Death, Birth, or Morbidity Rate ■= 

D' or R' or M' 

P' 
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where 

D' * deaths in a specified class of population 
=* births in a specified class of population 
Af ' =« number of persons sick in a specified class of population 
P' = total number of persons in the specified population group. 

The specific death rates at various ages will of course vary 
greatly. 

If it is assumed that 100,000 persons are bom at the same 
instant a hypothetical table showing the number of eurvivors, the 
number dying, the rate of mortality and the stationary population 
at each age interval can be constructed. A table of this kind is 
knowm as a life table. ^ 

From the life table a stationary life table may be prepared 
show ing the number of persons per million of each yearly interval 
of age. Table 42 is such a table. 

Standardized Death Rates 

Due to various factors such as immigration, type of community, 
etc. the actual age distribution in one location may differ greatly 
from that in another community making it impossible to directly 
compare the crude death rates for all ages for the two localities. 
An allowance must be made for the difference. This is accom- 
plished by means of standardized and corrected death rates. 

The standardized death rate is obtained by applying the specific 
oeLth rate obtained from the general populatioii or a life table to 
the actual age distribution of the given population. The rate 
obtained is the rate that would exist if the liypothetical specific 
rate existed with the actual distribution of age. 

Standardized Death Rate = - 

where 

p actual popmlation for each age 
q specific death rate from life table 

A comparison of this rate to the death 
population (from the life table) gives use to 

Correction Factor = 

where 

R death rate from life table 
R' standardized death rate 

Multiplying the crude death rate by this correction factor will 
make an allowance for the different age distributions. 

* Bee Glover, J W , United StateaLife Tables 1890, 1901, 1910aDd 1901-1010 Bureau oi 
CenauB, W aahingtnn, 1021 


(Pj) 


^(ff) 


rate of the standard 
a correction factor. 

Ji 

W 
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Table 42 — ^Slalionary Life Table Population of 1 ,000,000 Persom. 
Number Living in Each Yearly Interval of Age. 


Age 

Interval 

Persona Per 
Million m 
Current Age 
Interval 

Age 

Interval 

Persona Per 
Million in 
Current Age 
Interval 

Age 

Interval 

Persona Per 
Million in 
Current Age 
Interval 

0- 1 

17,841 

35-36 

14,146 

70- 71 

6,373 

1- 2 

16,916 

36-37 

14,031 

71- 72 

6 979 

2- 3 

16,612 

37-38 

13,912 

72- 73 

6,597 

3- 4 

16,448 

38-39 

13,791 

73- 74 

5,178 

4- 5 

16,338 

39-40 

13,667 

74- 75 

4,776 

6- 6 

16,255 

40-41 

13,540 

75- 76 

4,375 

6- 7 

16,186 

41-42 

13,411 

76- 77 

3,978 

7- 8 

16,127 

42-13 

13.278 

77- 78 

3,589 

8- 9 

16,078 

43^4 

13,141 

78- 79 

3,210 

9-10 

16,036 

44-45 

13,000 

79- 80 

2,843 

10-11 

15,998 

45-46 

12,854 

80- 81 

2,490 

11-12 

15,962 

46-47 

12,702 

81- 82 

2,152 

12-13 

15,927 

47^8 

12,645 

82- 83 

1,835 

13-14 

15,890 

48-49 

12,383 

83- 84 

1,546 

14-15 

15.851 

49-50 

12,216 

84- 85 

1,287 

15-16 

15,808 

50-51 

12,045 

85- 86 

1,058 

16-17 

15,761 

51-52 

11,867 

86- 87 

859 

17-18 

15,708 

52-53 

11,683 1 

87- 88 

687 

18-19 

15,650 

53-54 

n,489 

88- 89 

641 

19-20 

15,586 

54-55 

11,284 

89- 90 

418 

20-21 

15,516 

55-56 

11,067 

90- 91 

318 

21-22 

15,441 

56-57 

10,836 

91- 92 

236 

22-23 

15,363 

57-58 

10,592 

93- 93 

172 

23-24 

15,282 

58-59 

10,336 

93- 94 

123 

24-25 

15,200 

50-00 

10,069 

94- 95 

86 

25-26 

15,117 

60-(il 

9,791 

95- 96 

69 

26-27 

15,032 

61-62 

9,501 

96- 97 

39 

27-28 

14,946 

62— o3 

9,199 

97- 98 

26 

28-29 

14,857 

63-64 

8,884 

98- 99 

17 

29-30 

14,765 

64-65 

8,556 

99-100 

10 

30-31 

14,671 

65-66 

8,217 

100-101 

6 

31-32 

14,573 

66-67 

7,868 

101-102 

4 

32-33 

14,472 

67-68 

7,608 

102-103 1 

2 

33-34 

14,367 

68- 69 

7,139 

103-104 I 

1 

34-35 

14,259 

69-70 

6,760 

104-105 

1 


Source: Pearl, Raymond, Medical Biometry and Statisti-cai page 259. 


Corrected Death Rates 


The corrected death rate is obtained by using the specific death 
rates of the locality and hypothetical age distribution of the life 
table. This computation places the rates on a strictly comparable 
basis in-so-far as the age distribution is concerned. 


where 


Corrected Death Rate 


^ (p'g') 

2 ip') 


V " population in age group from life table 
^ - actual specific death rates 
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Corrections have thus been made for varying age distributions. 
In a similar fashion corrections may be made for other differences 
such as sex, race, etc. Rates other than death rates may be 
corrected in the same manner. 

Production 

Quality Control 

The quality of a given product may be defined as its conformity 
to given standards or specifications. A manufactured product 
exhibits a certain amount of variation in its conformity to spe- 
cifications, no matter how carefully guarded the process may be, 
due to innumerable chance causes. As long as only these chance, 
uncontrollable factors are the cause of variation the quality is said 
to be controlled. However, as soon as some controllable cause 
enters to cause variation, control is no longer had and the product 
departs from conformity to standard. 

The problem of quality control is thus essentially the determina- 
tion of the entrance of factors other than those which from an 
economic viewpoint should be left to chance. Dr. Shewhart^ 
suggests the following criteria to determine whether the variations 
exhibited by any measured characteristics of a manufactured 
product are due to chance or some controllable factor or in other 
words the existence of or lack of ‘'control.'^ * 

Criterion I 

1. Divide the data into m rational (groups such as week, 
day, plant, etc.) of n items each. 

2. Determine a statistic such as A or j for the entire group. 

3. Represent this value as a horizontal line on a graph. 

4. Compute the standard error of the statistic, 

(Jp = V N pq 

Ga * TT etc. 

V2N 

When n is small the theory of small samples (see pages 129- 
130) must be used. 

5. Measure off a zone of 3 standard errors on either side of the 
horizontal line. The resulting diagram is a control chart. 

1 W A. Shewhart , Economic Control of QwMlity of Manufacttwod Product, D. Vsn Noeknod, 
New York, 1931. 

>The rliBoiiHHion of tbiB topic is of neceaaity very brief. For a detailed and oomplete dlS- 
ouMion which ia easentisl to a thorough undenitaDding the reader u referred to Dr. Shewhart’s 
book 
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6. The BtatieticB for each group ( X, p, a etc.) may now be 
located on the control chart. If any of the points fall out- 
side of the control zone there is an indication of the 
presence of assignable causes of variability which should be 
investigated. 
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25,000 
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From Shewhart, W. A. Boonomic Control ofQwUUy of Manufactured Produd, p. 313. 




-I 
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Fig. 37. Control Chart for Small Samples Showing Lack of Control 


Criterion n 


1. Divide the data into m rational groups of n items each. 

2. Compute d 


d 

where 


n 

n - 1 


^ - 


m 

m - 1 


na. 


i 


a* - average o* for all groups 
Oi* * variance (square of standard deviation) 
of the means of the groups. 

For a Bernoulli distribution — a constant system of causes 
— d will equal zero. However due to sampling fluctuations 
a value as great as 3crd may arise with the probability that 
99.7 chances out of 100 that no greater value will occur where 


W 2 (mn — l) / n "I 
m{m - l) {n - l) \n - 1 / j 

If a value of d greater than 3d is secured it is an indication 
that the samples are not drawn in a Bernoulli fashion or 
that the system of causes is not controlled. Indication is 
had that control is lacking. 


Criterion m 

1. Determine a suspected factor which might be the only 
assignable cause of the variability and two sets of variables 
in which may be caused by the suspected factor. One of 
the variables should be controlled. 
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2, Secure the coefficient of correlation (r) between these two 
variables. 

3. If r is greater than 3ffr indicating a significant correlation, 
the suspected factor may be said to be the assignable cause. 

Criterion IV 

1. Obtain n observations and calculate some statistic (such 
as X or a). 

2. Choose some factor which may or may not be an assignable 
cause of variation and select n additional observations 
under conditions where it is known that they cannot be 
affected by this factor. 

3. By means of the standard error of the difference betAveen 
the two computed statistics determine whether the dif- 
ference is significant. If the difference between the statis- 
tics is greater than three times the standard error of the 
difference, the suspected cause is the assignable cause as 
when : 


A’, - A, 

3 (ai* + 

Ol — ^2 



Criterion V 


1. 


2 . 


Fit an ai)proiiriate frequency curve to the grouped data 
by calculating A", c and the skewness k and using 


1 


fl 

fx 

. 2 I 

1 "3<Td_ 


2 ^* 


Test the theoretical distribution for goodness of fit through 
the y} test. If the fit is poor (P is less than .001) this is an 
indication of lack of contnd. 
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A. Long Method 
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x.-z + ^c 

Median 
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Mode 
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Mode 
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13 
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B. Empirical Method 
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On 




(x») 


N 


1 1 1 

1 X, X, X, 


+ J_ 
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Measures of Dispersion 

Mean Deviation 

Ungrouped Data 


27 


27 
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AT X 
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/ . ^ W , (X.+ X0c+/„(.25 + c») 


N 


N 
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33 

Rietz 



LIST OF FORMULAS 


185 


Standard Deviation 

From Ungrouped Data 



V Y(x^ A V 
N ( iV ) 

From Grouped Data 
Long IMethod 



Short (unit deviation) Method 
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LIST OF SYMBOLS 
a — Y intercept 

a — number of ways in which a favorable outcome can 
appear 

b — coefficient of slope (or regression) 
b — possible number of unfavorable results 

6 ii .*4 — coefficient of net regression of on Xi excluding 

Xm and X 4 

C — size of class interval 

CA — chronological age 

CC — coefficient of contingency 

c — difference between arbitraiy^ origin and mean or 
median 

c — constant in trend or regression equation 
D — differences in rank 

d — deviation of an individual value or midpoint (»f a 
class interval from an arbitrary or guessed mean 
(or average — other than arithmetic mean) 
d — deviation from line of trend ( Y — Fr) 
d — constant in trend or regression equation 
d* — d in class interval units 

e — a constant « 2.71828 

/ — frequency 

fa — frequency of class interval above modal group 

fh — frequency of class interval below modal group 

Q — Pomtive differences in rank 

Om — Geometric Mean 

Hm — Harmonic Mean 

h\ — deviation (abmodality) of father 

hi — deviation (abmodality) of mother 
h| — deviation of mean of given characteristic of off- 
spring from mean of characteristic of all offspring 
7 — item 

LQ. — intelligence quotient 

k — cc^efficient of non-determination 

Lm» — lower limit of class interval containing median 

Lmo — lower limit of modal group 

M .A. — mental age 
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MD — mean deviation 

MD* — mean deviation in class intervals 

M.P, — mid point 

^ or n — number of cases 

Nl — number of cases overstated or too large 
Ns — number of cases understated or too small 
P.E.w — probable error of mean 

P.E,e — probable error of any statistical measure (0) such 
as P.E.^y P.E.rj etc. 
j) — product moment 

p — probability of success 

Pm — price of a commodity in period n 

Po — price of a conunodity in base period 

Po — price of first commodity in base period 

Po" — price of second commodity in base period 

Pi — price of a commodity in first period 

Po — price of a commodity in second period 

Qi — first quartile 

Oi — third quartile 

QD — quartile deviation (semi-inter-quartile range) 

Oo, — quadratic mean 

q — probability of failure 

Qm — quantity of a commodity produced or consumed 

in period n 

Qo — quantity of a commodity produced or consumed 
in base period 

Qo' — quantity of first commodity produced or con- 
sumed in base period 

Qo" — quantity of second commodity produced or con- 
sumed in base period 

qi — quantity of a commodity produced or consumed in 
first period 

q% — quantity of a commodity produced or consumed 
in second period 

R — coefficient of correlation computed by Spearman’s 
"footrule” method 

Rim — coefficient of multiple correlation between Xi and 

Xi, X, 

r or fu or — coefficient of correlation 

T — coefficient of correlation corrected for number of 
cases 

Ti — coefficient of heredity (fathers and off-spring) 
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r% — coefficient of heredity (mothers and off-spring) 
fi — coefficient of assortative mating (fathers and 
mothers) 

fii — coefficient of reliability 

Tij.i — coefficient of partial correlation between Xi and 
Xi excluding X| 

laTti — coefficient of part correlation between Xi and Xm 
excluding Xiand Xi 

Si.%9i — standard error of estimate measured about regres- 
sion surface Xi (Xi) -|- f (X|) -^f(X^ 

— standard error of estimate 

Sp — standard error of estimate corrected for number of 
cases 

W orWt — weight 

X — an individual value 

X — arithmetic mean 

X — deviation of individual value from its arithmetic 
mean 

Y — an individual value 

Y — arithmetic mean of Y values 

F, — computed Y value as determined from line of 

trend or regression 

y — deviation of individual Y value from its arith- 
metic mean 

Z — an arbitrarily selected value-guessed mean ^ 
z — standard score 

z' — residual-difference between actual value and the- 
oretical line of regression value 
Zi — standard score on first test 

Za — standard score on second test 

oTa — measure of skewness 

p, — curve criterion 

Pa — curve criterion (measure of kurtosis) 

Pia. 3 , ^ 12 . 34 , 

Pi 8 2 4, P24-28 — beta coefficients 

^ — test for linearity of regression 

T] — correlation ratio 

0 — any statistic 

K — curve criterion 

K — number of arrays in correlation table 

\Li' [12' (As' M.4' — moments about arithmetic mean corrected for 
grouping (Sheppard's Corrections) 
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moments about arithmetic mean 
moments about arbitrary origin 
a constant « 3.141593 

index of correlation (measured on basis of cur- 
vilinear line of regression) 
coefficient of correlation from Ranks 
sum of 

standard deviation 

standard deviation corrected for grouping error 
standard deviation of values about means of re- 
spective columns in correlation table 

standard error of arithmetic mean, the standard 
error of any statistical measure or^, a„, 

etc.) is similarly written 
standard deviation of an array of off-spring 
standard deviation of off-spring in general 
mean squared contingency 
measure of skewness 

chi square, value used in test for goodness of fit 
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TECHNICAL APPENDIX I 

DERIVATION OF SHORT METHOD OF 
COMPUTING ARITHMETIC MEAN > 


Fpr Ungrouped Data 

Let each value ( X) equal 

Z -Z + d 

or the arbitrary starting point plus the deviation of the value 
from that point. 

The total of all values 

2Z + 2d 

but since Z is a constant, this may be rewritten 
2 X » NZ + 2d 


Dividing the total by N to obtain the arithmetic mean the 
result is 


N ^ ^ N 


or 


Z = Z + 


2d 

N 


For Grouped Data 

The midpoint of each group may be measured as a deviation 
from the guessed mean. 

M.P. = Z 4- d 

To obtain the total value of all cases in the class interval the 
midpoint of each group is multiplied by the number of cases in 
the group 

f X MP ^ fZ -^fd 
Totaling up for all class intervals 

2(/ X MP) - 2CfZ) + S(/d) 
or since Z is a constant 

2(/xMP) =Z2(/) + 2(/rf) 

and since 2(/) « Z 

2(/x MP) ^ NZ ^ 2(/-d) 

■ Tbi« denviition io after Yule. 
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dividiDg by to obtain the arithmetic mean 
2(/ xMP) ^ . SCfd) 

N ir 


TECHNICAL APPENDIX II 

^ DERIVATION OF SHORT FORMULA FOR 
STANDARD DEVIATION 


If d, is the deviation of a given point from an arbitrary origin 

(Z) 

X -Z 

and if the difference between the mean and this origin is termed c - 

c- X-Z 

then _ _ _ 

d - c - (Z - Z) - (X - Z) 

= z-z- z + z 

« A" - Z or X 

where x is the deviation of a value from the arithmetic mean but 


N 


Since 


x-z.^ 



d — c - j 


d - X + c 
d’ — z’ + 2cx + c’ 

and 

/(d«) - /(z*) + 2cfz + /c» 

and 

S/(d*) = 2/(z*) + 2cZ{Jx) + 


- ZJix^) + 2cZ(Jx) + 

For 

Zf ~ N 


But the sum of the deviations about the arithmetic mean is zero 

2(/i) - 0 

and the formula reduces to 

'SfM*) - y,f(x^) + Vc* 
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m 


or 

and 


but 


but as above 


and 


2:/(x») - 2/(d*) - JVc* 
2/(x*) 2:/(d«) 


N 


N 


— c 




(see page 36) 




'm 

N 


■ - (f )■ 


TECHNICAL APPENDIX III 

DERIVATION OF SHORT FORMULA FOR 
STANDARD DEVIATION— UNGROUPED DATA 


A simpler formula may be arrived at for ungrouped data by 
selecting zero as the arbitrary origin then 


but since 
and 

since 

and 


d ^ X - Z 
Z = 0 
d = X 

c = X - Z 
c » X 

+ 2cx + c* (see page 212) 

d » X 


and 


X* » I* + 2ci -h c* 
S(X») - S(x*) + 2c2(x) + Nc^ 




but 


2(j) - 0 and c - X 
/. S(X«) - E(x*) - JVX* 
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and 


Since 


- NX^ 


S(x^) ^ S(X») 

N ° N ~ N 



CJ 


a 


_ 

\ N 



2 


TECHNICAL APPENDIX IV 

DERIVATION OF ‘‘NORMAL” EQUATIONS FOR 
LEAST SQUARES STRAIGHT LINE 


As shown on pages 52-54 of the text the formula for any straight 
line will be 

y, - 

where Yc represents the computed or theoretical value for Y 
obtained by substituting the appropriate value in the formula. 

The problem is to determine a line which will fulfill the con- 
ditions of the principle of least squares; i.e., the sums of the 
squares of the deviations of the actual from the theoretical values 
will be a minimum. 

The letter d may be used to represent the difference between 
the actual and theoretical values. The purpose is then to obtain 
a line so that: 

S(d*) » a minimum 
but d = Y - Ye 

.1(7- Te)* must equal a minimum. We may then obtain 
the partial derivatives with respect to a and h and equate to 
zero to obtain a minimum. 


s( Y - Y,y ^ 2bI(X») - 22( YX) + 2a2(X) 
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Equating to zero; 

I 2 Na - 2i:( Y) + 2bY,iX) - 0 
IT 2 /)Z(X*) - 21'(A' F) + 2n2:(X) = 0 
or 

I ^(Y) = An + b'^(X) 

II I(.Y 1') = aI(X) + M(A'=) 


TECHNICAL APPENDIX V 

DERIVATION OF PRODUCT MOMEN'l’ FORMULA 
FOR COEFFICIENT OF CORRELATION 


'J'lie original formula for r is 


w here 


- 

- 



assuming a straight line regression 

Fc = a + bX 


(n 

( 2 ) 

( 3 ) 


F. US used for the theoretical value obtained from the equation 
but 

F - F, (4) 

.-.d = I - (fl + bX) (5) 

d ^ Y - a - bX (6) 

luultqih iiig by d 

d^ = dY-ad-bdX (7i 

since there is one d for each value a summation is made for a 
points 

I(d») - S(d F) - nY.id) - b'^{d X) (f 


Since the regression line is fitted by the least squares method 
l’(d) = 0 
l(dA) - 0 

.-.Ifd*) - X(dF) (9} 

rnulliplying 

- F - n - fcA 
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by Y and summing up 



2 (d y) - 2( y») - aE( y) - f» 2 ( y y) 

(10) 

but since 

2(d*) - 2(dy) 

.■. 2 (d*) - 2 ( y*) - a 2 ( y) - b^iXY) 

(11) 

and 

2(d*) 2( y*) - « 2 ( y) - () 2 (Y y) 

N ^ N 

(12) 

but 

2d* 

.SI * = 

N 

2(y^) - a 2 (y) - 62 (Yy) 

(13) 

and 





(14) 

while 

K Y^) 

(15) 

substituting 

s * 

r* = 1 - ^ 



2( y*) - a2( y) - hi(x y) 



, , A' 



|-2 _ 1 — 

i:(y^) 

(10) 


— — r ^ 

N 



multiplying numerator and denominator of fraction by N 

, _ S(y^) -aSCF) - bXjXY) 

1^’) - Nc/ 

This formula may be reduced to* 

, al^iY) + bl{XY) - Nc/ 

~ 2( y>) - NCy^ 

1 he two uormal etjuatioDs for the^liue of regression are 
(I) 2(7) ~ Na + bl{X) 

(III) 2(.Yy) - a2(Y) + ^2(Y») 


(17) 


(18) 


n')i 


' ThiB lormulft ie known bi the "leaBt aqunreB" fiiniiula for the oonffioient of oorreleLlon. 
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If the point of averagee {X and Y) is used as an origin all 
values will be reduced to deviations from their respective means 
(x and y) 

where 

X - X - X 
y^Y-Y 

So that the equations will read 

(I) 2(v) «JVa-hb2(x) (21) 

(II) Z{xy) - a2(x) + 62(x») (22) 

but since the sum of the deviations about the arithmetic mean 
equals zero 

/.S(x) * 0 

^{y) - 0 

and the normal equation will reduce to 


(1) Na = 0 

(23) 

a - 0 


(II) 2(xy) - fcl(x*) 


. r ^(xy) 

■ i:(x*) 

(24) 


reducing equation (17) into terms of deviations from the point of 
a storages ^ 


but 


al(j/) -h 6 I(x7/) - Ncj,^ 

= 0 

u^Liy) = 0 
and Cv = 0 


(25) 


the equation thus reduces to 

^ 6V(xv) 

but from (22) 

" “ X(x^) 

• ri = -(J'l/) 

l(i^) ~{y^) 

“ l(x=) . 


(26) 



Since “a" (the K >iJipr(*€>pt) LSjuela leru the line will paei throCKb the oricin or the point 
cl BveraKOi 
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dividing numerator and denominator by 

, m 


bul 


N N 


= —Jr— (^ee rliapter IV) 


N 

V 

u heie 


-'00 cliapter IV) 




( 29 ) 


m) 


m) 


Using the values X and V as deviations from an arbitrary 
origin (in the case of ungrouped data, zero, so that ttie rriiginal 
values may be used) p may be computed from 


P 


N 


CxCy 


(32) 


where x' and y' are deviations from arbitrary ^elected points 
for 


x' = X + c, 

VI here c, is the difference between the true moan and an arbitrary 


A(/rf) 


::fX) 


origin ( — ^-2 for grouped data and ^ ' for ungrouped vNliore 


zero is selected as an origin) 

y' “ y + c, 

x'y' = xy + c^y + c,x + (33) 


summing up for all points 

= 2(xi/) H- c,X(j/) -h Cyl.{x} -f Nc^cy (34) 

but since the sum of deviations about the mean« total up to zero 

S(y) = 0 
S(i) - 0 

•lid equation 34 reduces to 

2(iV') - Sriy) + Nc.c, 
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dividing by N 

N ~ N 
. I(xi/) _ X(x'y') 
" N N 

]f the ai'idtrary origin used is zero 

r' = X 

y'= 


"i“ CxCff 


CiC^ = p 


and 


P = 


Kxyj i:(xy) 


N 


N 


CxCy 


TECHNICAL ATl^ENDIX VI 

- DERIVATION OF FORMULA FOR 
LINE OF REGRESSION 


Since the regression line is assumed to be straight its formula 
will be of the type 

Y ^ a + bX 

with the two "normal” equations' 

(I) 1(7) - JVa + f)S(X) 

(II) i:(X7) = a2(X) + b2(X*) 


If the origin of the line is assumed to be at the point of averages 
the normal equations will read 


(I) Z(y) ^ Na+ bS(i) 

(II) 2(xy) = 2(i) - t>S(z*) 

but 

2(y) = 0 

S(z) - 0 

(I) Na = 0 and a - 0 
(II) S(xy) = 6S(x«) and b - 
Equation (1) will reduce to 


where 


y ~ bx 


b 


S(xy) 

Sti") 


(s) Bae oikApter VI. 
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diTiding numerator and denominator by N 


but 

but 

and 


N 


N 


I(i' 

N 


Z(xy) 

^(jy) ^ ^(-Ty) a, 

N Nsi^i, aj 

— {ry) . 

T = (product moment formulii) 


y r ~x 

Jr 


TECHNICAL APJ'ENDIX VII 
MULTIPLE CORRELATION RIT'.RESSION 


The “normal” equations for three i:idei)eiuient variables, 
linear correlation with the type formula 



X) = a + 6 

11-84 + ^ 13-24 

A"a + 6^.23 A 4 



are: 






(I) 

2(X,) = 

Na 6 u .84 H- f^l 3-24 






L ( A^a) -h bi 4 23 — ( 

A')) 

(ii) 

2(X,X,) - 

a2(Xj) -h ^>12-34 

L(Al’*) + 6 ij.j4 






—(AtXi) + 614.U 


A'4) 

(HI) 

2(:ViXi) - 

a2(Xj) + bii-ii 

KX.J,) + 6i,.,4 






2(X.*) + b. 4 .n 


A4) 

(IV) 

£(X,X4) - 

a^(Xi) -h 612.34 

i;(X.X 4 ) +(>u.« 




I(X,X4) +6u.«S(X4*) 


These equations may be simplified by assuming the origin to 
|>e at the point of averages and dividing both sides of the equa- 
tions by N 


(I) 


N 


a + 6)1 


S(x,) . , E(x,) 


+ fel 


N 
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(ID 

(HI) 

(IV) 


-(a’lJ'j) 

<1- (^r,) 

N 

A' 

I(i..rA 

uKr.) 

N 

N 

Kill,) 

ul(^0 


f i>ii-M 77 r Om-W 


+ ^’u-n 


N 
A 

nix,!,) 


N 

+ ^18 X 4 T? ^ 14 -U 


+ ^l:.X 4 


AT 

2 1 X 8 X 4 ) 


“ + hu-i 


N 

IfX,.T 4 ) 

N 

2 (T 4 *) 


N N N ^ N ^ N 

where Xij Xj. Xj, X4, i^present deviations from the respective 
means, A"i .V2, -V| and A"4 but since the sum of the deviations 
flbfuit the arithmetic mean is ^ero 

^ Ifxa; ^ ^ ^(Ja) ^ -fx,) ^ ^ 


N 


N 


N 


and ‘ 


A' 


CTz 




02 ^ = 


I(X 2 ^) 


V llllc 


N 


", = 


^ (V) 

A ’ 




2 _ 


^(l 4 D 

iV 


A’ 

A 


= /'i! (Die product niotnent)’ 


= /‘IS etc. 


where tlie value of the product nioirient may be computed from 

::(A-,.Vi) KA',) 1 (A,) 

N ' A A 

I he ' normal” equations will now read 

Pl2 - 12-84 '^ 2 ^ ^ 18-24 /^23 + ^ 14-23 />24 

/ha = t^i2-34 ]hi d" ^^13-24 d" ^14-23 /h4 

7h4 = ^h2-34 ]>24 d- ^13-24 ]>M d" ??14.23 ^4* 

See page 3-*^ 

'See page B1 
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TECHNICAL APPENDIX VIII 

STANDARD ERROR OF ESTIMATE- 
MULTIPLE CORRELATION 


The formula for the standard error for multiple correlation 
may be derived in the same fashion as the "least squares” formula 
for simple correlation 

_ S(X.«) - - hn.u^(X,X,) - bu.u ^(XiX,) 

N 

reducing this to deviations from the resfjective means thi,s will read 
VI _ ^1*) ) -(x,x,) ^ I(XiZ,) ^ I(xii4) 

1-Jl * A7 ~ ^7 — ^18-14 77 “ Ri4-Z* 




N 


N 


N 


or 


*^*1-13 - — ^>19 U P\S — ^^14 JS Pl4 


TECHNICAL ALLENDIX IX 

DERIVAIION OF STANDARD ERROR OFi 
IHE ARITHMETIC MEAN 

N random samples of n items each are drawn and the individual 
values expresscfl as deviations from the true arithmetic mean of 
the universe. 1'liis rnav^ he WTitten as follows: 


Item 

Sample 

Sample 

Sample 

Samfile 

Number 

#1 

#2 



1 

j' 

j' 

j' 

.r' 

2 

t' ’ 

j t ! 

i' ' 

x' ' 

d 

t' ' ' 

' ' 

x' ' ' 

j" ' 

71 

1' 

N Xa 

N jj 

■Th 

- 2*4 


If items #1 and #2 of each sample are added 

= x' -f- x" 

hut the standard deviation is equal to 



Tbin denvation follows Eaekiftl'H in XfwthodM nf Correlation Analv^i'^ 
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or in this instance 



-“V 

N 

or 


^ 2 


but 

-1+2 - 

N 


= a-' + .r")^ 

= (j')^ + L>(-r'a-'0 -t ii'y 

and 

= I'J-'') + --’llj-'j") + 


Since the successive items of the ^anlple are drawn at random 
they are uncorrelated (r = e) and tli(*r(*rore iLrV") =- 0. 

or after dividinji; by N 


But as the number i, A') of the samples is increased will tend 
to approach the standard deviation of the uiiiA crse from which the 
samples ^^ere drawn ns will in a similar manner ::x" 

or Jx* = 

and \^ hen A" is \ ery large 


For the sum of the first three items 

when N is large 

And for the sum of n items 

^'i ■ - B = F ....F Cj* 

= when N is large 

Dividing all items and totals by n gives 





and 
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and since tends to equal Ux 



and tor x* and i" 


when N is lar^^e 



or for the sum of n items 



but since 



for each sample = 


X 



ax 

“Viv 

where 

Jr is the standard deviation of the un{iVTi>v and not of the 
sample. Ho\\e\er lacking this value the standard dela- 
tion of the samjile (a) is U'-ed as an estimate of this ^ alii'- 
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